Cromwell-workflow-management-system
Cromwell is a workflow management system geared towards scientific workflows. More information can be found
online
.
It can function as a workflow-execution engine that can parse jobs written in the workflow-definition language (
WDL
) and execute them on a range of different backends. This makes workflows written in WDL easier to share, easier to migrate and opens up the usage of pipelines developed elsewhere. In addition, Cromwell remembers each subtask input and output and repeats only the tasks it needs to; reusing output of previously run tasks. This can be of great benefit when your 30-step analysis pipeline crashed at step 29!
HPC Cromwell-as-a-service (CaaS)
The HPC team has setup a Cromwell-as-a-service. This (CaaS) is serviced from a dedicated server running a separate Cromwell-service instance per user. A user can then post jobs to a provided URL of his/her personal instance. A posted job (or workflow) will then be processed on the HPC in the users’s name.
A user will need to explicitly request access to the service for it is quite a waste to have one for each user by default; not everyone will use the service. The (CaaS) is secured by a Basic Authentication with a custom username / password. When posting a job to the (CaaS) you need to submit these credentials.
Before you request usage of the (CaaS) we kindly request you to quickly scan the
Cromwell documentation
. We especially recommend the very short
5-minute introduction to Cromwell
to get a feel for what Cromwell can do. Please note that within the introduction it refers to running Cromwell in Run mode whereas the (CaaS) runs Cromwell in Server mode. Server mode has a slightly shorter
introduction
. Because (CaaS) is about running Cromwell as a service with a REST interface, reading about
Cromwell Server REST API
is crucial.
Apply for the service
If you want to request a (personal) Cromwell service, send an e-mail to
hpc-systems@lists.umcutrecht.nl with:
- A HPC username of user (e.g.
johndoe
)
- A secure way to share the password(s) with you (i.e. mobile-phone number)
- An existing workflow-execution directory (i.e.
/hpc/institute or group/johndoe/cromwell-executions
), accessible from a HPC compute node and writable to the user
- An existing workflow-log directory (i.e.
/hpc/group/johndoe/cromwell-logs
), accessible from a HPC compute node and writable to the user
- The version of Cromwell the (CaaS) needs to run for you (default: the latest configured)
- Username of a read-only account; a secondary user account for the (CaaS) instance that can only do GETs but not POST new jobs on the REST service (default: none)
Example: job submission
A workflow can be submitted to the
RESTful API endpoints
of the Cromwell service. These endpoints are prefixed by the URL of a Cromwell service (e.g.
johndoe.hpccw.op.umcutrecht.nl). As is the nature of RESTful APIs this can be done in many different ways. One such way is using the widely-available command-line interface command
cURL
.
Below is an example where
hello.wdl
is the workflow,
name.json
is the input for the workflow, and
johndoe
is the user:
$ curl -X POST "https://johndoe.hpccw.op.umcutrecht.nl/api/workflows/v1" \
--user johndoe:password1 \
--header "accept: application/json" \
--header "Content-Type: multipart/form-data" \
--form "workflowSource=@hello.wdl" \
--form "workflowInputs=@name.json"
This should then output:
{
"id":"3415ad29-ecc0-4a9d-93e2-660d0a95945d",
"status":"Submitted"
}
Which tells us that the job is been successfully submitted to Cromwell and has been given the
id
3415ad29-ecc0-4a9d-93e2-660d0a95945d
.
Cromwell will pick up the job and start to execute each of the tasks in the workflow on a submit node of the HPC.
Keeping tabs on the progression of the workflow can be done via the
RESTful endpoints
status
/
timing
/
metadata
in combination with the the
id
of the job. For instance; visiting
https://johndoe.hpccw.op.umcutrecht.nl/api/workflows/v1/3415ad29-ecc0-4a9d-93e2-660d0a95945d/status
in the browser will output:
{
"status": "Succeeded",
"id": "3415ad29-ecc0-4a9d-93e2-660d0a95945d"
}
And tells you that the workflow succeeded; which entails it successfully finished.
Aborting a running workflow is also facilitated by a
RESTful endpoint
abort
in combination with the
id
of the job. For instance; running curl -X POST
https://johndoe.hpccw.op.umcutrecht.nl/api/workflows/v1/3415ad29-ecc0-4a9d-93e2-660d0a95945d/abort
in a terminal will output:
{
"status": "Aborted",
"id": "3415ad29-ecc0-4a9d-93e2-660d0a95945d"
}
Doing every submission via cURL can be bothersome. Nearly all popular programming languages have packages or support for calling RESTful APIs (e.g. Python:
requests, R:
RCurl. However, there are several Cromwell specific alternatives.
Swagger UI
Simply visiting
https://johndoe.hpccw.op.umcutrecht.nl
in a browser will provide you with a Swagger user interface that can aid in building and test cURL-like calls to the Cromwell service.
Broad institute has also developed a
Python package
to interact with Cromwell. This package can also be used via the command-line interface.
$ cromwell-tools submit --username johndoe --password password1 --url https://johndoe.hpccw.op.umcutrecht.nl -w hello.wdl -i name.json
Technical details
Caching heuristics
Having each Cromwell service calculate a md5 checksum over multiple files puts a considerable load on the input/output throughput of the HPC storage. As such, Cromwell is configured to use
"path+modtime"
instead of
"file"
as caching heuristic. More information can be found
here
. In short, do not modify any input files and then manually reset the modtime on those files. Silly you if you do.
Outside access to the (CaaS)
The URL of the Cromwell service (i.e.
https://johndoe.hpccw.op.umcutrecht.nl
) can only be accessed from within permitted domains. The UMC domain is permitted by default. It is possible to permit access from a different domain; please contact
hpc-systems@lists.umcutrecht.nl and provide a source IP address.
Configuration
To reduce administrative load for the HPC team only a single configuration exists for all Cromwell services of a specific version. These are provided by the Kemmeren group in Princess Máxima Center and are a result of years of experience with Cromwell. The configurations are hosted on
bitbucket
.
WDL version
At the time of writing the WDLs submitted to the Cromwell service are by default expected to be in
draft-2
of the WDL specifications. You can override this via
workflow options
.
HowToS
Backup a (CaaS) database
By default your (CaaS) database with the workflow cache and metadata is NOT backed up. It is possible to set this up yourself; access to the MySQL database is provided.
- Login to one of the HPC transfer nodes
$ ssh johndoe@hpct01.op.umcutrecht.nl
- Create a directory to hold MySQL configurations and securely set the permissions
$ mkdir ~/.mysql && chmod 0700 ~/.mysql
- Create a file
backup-caas-johndoe.cnf
with the following content (note the slightly abnormal host!):
[mysqldump]
user=johndoe
password=password1
host=johndoe.hpccw02.compute.hpc
- Test-run creating a backup using
mysqldump
$ module load mysql; mysqldump --defaults-extra-file=~/.mysql/backup-caas-johndoe.cnf johndoe | gzip -9 > johndoe-backup_`date "+\%F_\%H\%M"`.sql.gz'
- Create a cronjob (i.e. a command that is executed periodically
) to do this backup for you. Note that you might want to specify a target directory. For instance, have it every Sunday at 5am:
$ command='module load mysql; mysqldump --defaults-extra-file=~/.mysql/backup-caas-johndoe.cnf johndoe | gzip -9 > /data/isi/g/group/johndoe-backup_`date "+\%F_\%H\%M"`.sql.gz';
$ (crontab -l ; echo "0 5 * * SUN bash -l -c '$command'") | crontab -
Restore a (CaaS) database
Restoring a (gzipped) database dump can be done as follows 1. Login to one of the HPC transfer nodes
$ ssh johndoe@hpct01.op.umcutrecht.nl
1. Unpack the backup while redirecting it to the mysql client
$ gunzip < johndoe-backup.sql.gz | mysql --user=johndoe --password --host=johndoe.hpccw02.compute.hpc johndoe
Connect to the (CaaS) via the HPC gateway
The permitted-domain security of the (CaaS) can be circumvented in a slightly convoluted way by use of a SSH-proxy jump via the HPC gateway with the addition of a port forward:
$ ssh -L 4242:johndoe.hpccw.op.umcutrecht.nl:443 \
-l johndoe \
-o ProxyCommand='ssh -q %r@hpcgw.op.umcutrecht.nl -W %h:%p' \
hpcsubmit.op.umcutrecht.nl
If you then communicate with a RESTful endpoint you NEED to communicate with URL
https://localhost:4242
(note the http
s) instead of
https://johndoe.hpccw.op.umcutrecht.nl
, set the
Host
header for the HTTP request, and allow for insecure connections:
$ curl https://localhost:4242/api/workflows/v1/backends \
--insecure \
--user johndoe:password1 \
--header "Host: johndoe.hpccw.op.umcutrecht.nl"
WDL examples
Below are the content of the files that are used in the examples above. A full WDL specification can be found
online
.
file: hello.wdl
task hello {
String name
command {
echo 'Hello ${name}!'
}
output {
File response = stdout()
}
}
workflow test {
call hello
}
file: name.json
{
"test.hello.name": "World"
}