How to run Snakemake on ScienceCluster¶
This guide describes how to configure Snakemake so that it submits each job automatically to the cluster queue. Snakemake documentation provides basic guidelines for cluster execution. The approach based on cluster config has been deprecated in favour of resources and profiles. Although a slurm profile is available, it has certain limitations, in particular the low job submission rate, which can make it impractical for large pipelines. There is also a simpler implementation that does not have the job submission rate problem. Below we provide an even simpler implementation. Even though it does not provide the same functionality as the official profile, it does not have any known limitations. It is loosely (but not entirely) based on the blog post by Sichong Peng.
The guide assumes that you are familiar with Snakemake and have already installed it into your user space. If you have not installed it yet, the easiest way would be to use the mamba
or anaconda3
module and to install snakemake-minimal
or snakemake
packages into a new environment. In most cases, snakemake-minimal
is sufficient especially if you are relatively new to Snakemake.
Mamba is a package manager that is fully compatible with conda
but performs better on certain tasks such as resolving requirements. For those reasons, we will use mamba
in this example. To load the module, run module load mamba
.
Missing
ScienceCluster does not support DRMAA.
Conda environment¶
We will use the following minimal conda environment. You may want to update to the newest versions of python and snakemake.
name: snakemake_cluster
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python=3.10.5
- snakemake-minimal=7.8.2
You can easily recreate it with mamba
by placing the definition to a yml
file, e.g. snakemake_cluster.yml
, and running mamba env create -f snakemake_cluster.yml
.
Cluster profile config¶
The profile configuration file defines the command used to submit jobs as well as default resource specifications for jobs. We will store it in env/slurm/config.yaml
. You can create the directory structure with mkdir -p env/slurm
where slurm
is the profile's name. Now, you can create the file env/slurm/config.yaml
and place the cluster parameters there. The advantage of this approach as opposed to storing the profile in the default location (see the Snakemake manual for details) is that the slurm profile will be added to your git repository.
jobs: 100
cluster: >-
sbatch
--ntasks 1
--cpus-per-task {resources.threads}
--mem {resources.mem_mb}
--time {resources.time}
-o log/jobs/{rule}_{wildcards}_%j.out
-e log/jobs/{rule}_{wildcards}_%j.err
default-resources:
- threads=1
- mem_mb=3850
- time="1:00:00"
The first parameter, jobs
, defines the maximum number of jobs snakemake would keep running and pending simultaneously. The number of jobs that a user can submit depends on the qos (Quality of Service), which in turn depends on the job's requested runtime. You can see the current limits by running sacctmgr show qos format=name,priority,maxwall,maxsubmit,maxtrespu%30,maxjobspu
, At the time of writing, the following limits are enforced (subject to change).
Name | Priority | MaxWall | MaxSubmit | MaxTRESPU | MaxJobsPU |
---|---|---|---|---|---|
normal | 0 | 1-00:00:00 | 10000 | ||
medium | 0 | 2-00:00:00 | 5000 | ||
long | 0 | 7-00:00:00 | 500 | cpu=1024,gres/gpu=24 | 24 |
verylong | 0 | 28-00:00:00 | 10 | cpu=64,gres/gpu=4 | 1 |
debug | 50000 | 00:04:00 | 1 | cpu=32,gres/gpu=2,mem=128G | 1 |
Priority
is the additional priority a job gets. MaxWall
shows the maximum runtime; MaxSubmit
is the number of jobs you can submit at a time. MaxTRESPU
is the highest amount of special resources a user can reserve. MaxJobsPU
is the maximum number of jobs that can run in parallel. Empty value in the MaxJobsPU
column means that the value is the same as for MaxSubmit
.
The second parameter, cluster
, specifies the command to be used for job submission. The only change you may possibly need here is the location of the output and error files. In the definition above, the file names consist of the rule name, wildcards and slurm's job id. You have to create the log directories in advance or the jobs will fail. In particular, you will need to run mkdir -p log/jobs
in the same directory where your Snakefile
resides.
Caution
In the cluster
parameter, >-
is used for line wrapping. The first character indicates that all line breaks except the one at the very end should be removed. The second character instructs to remove the line break at the very end. Spacing is crucial here. If a line has a different offset than the previous line, a line break will be inserted into the command and job submission will fail.
The last parameter, default-resources
, specifies the default resources a job will be allocated if the respective values are not defined in the rules within Snakefile
. The values are passed directly to slurm, so they should have the same format as the corresponding slurm parameters. For example, time
can be specified as 24:00:00
(24 hours) or as 1-00:00:00
(1 day).
Snakefile¶
For illustrative purposes, the Snakefile
contains two rules. The first runs a couple of bash commands then sleeps for a random number of seconds between 1 and 100. The second counts the number of characters in the output file that the first rule generates. In addition, both rules would print the information that slurm has about the submitted jobs. This will help us determine if the resources were allocated correctly.
rule all:
input:
expand("data/small_job_{iteration}.txt", iteration=[1,2,3])
rule big_job:
output:
"data/big_job_{iteration}.txt"
resources:
mem_mb=7700,
time="2:00:00"
shell:
"""
date '+%Y-%m-%d %H:%M:%S' > {output}
hostname >> {output}
echo "Host has $(ps -e | wc -l) processes running" >> {output}
scontrol show jobid=$SLURM_JOBID >> {output}
delay=$((1 + $RANDOM % 100))
echo "Will sleep for $delay seconds" >> {output}
sleep $delay
date '+%Y-%m-%d %H:%M:%S' >> {output}
"""
rule small_job:
input:
"data/big_job_{iteration}.txt"
output:
"data/small_job_{iteration}.txt"
shell:
"""
date '+%Y-%m-%d %H:%M:%S' > {output}
hostname >> {output}
wc -c {input} >> {output}
scontrol show jobid=$SLURM_JOBID >> {output}
date '+%Y-%m-%d %H:%M:%S' >> {output}
"""
Since the big_job
rule is expected to run longer and require more memory, we override the default mem_mb
and time
parameters in the resources
section of that rule. All other resource-related parameters will be taken from the profile that we have defined previously. Respectively, since the small_job
rule lacks the resources
section, Snakemake will obtain all parameters from the profile.
Batch script¶
For running the pipeline on the cluster, we are going to create a script called run.slurm
with the following content.
#!/usr/bin/env bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=3850
#SBATCH --time=1-00:00:00
#SBATCH --output=log/main_%j
module load mamba/4.14.0
eval "$(conda shell.bash hook)"
conda activate snakemake_cluster
snakemake --profile ./env/slurm "$@"
The first line indicates that the script should be processed with bash
when run. After that, there are several slurm parameters for the batch script. It will be using 1 thread and 3850 MB RAM for at most one day. The output will be saved to the file named log/main_%j
where %j
is the slurm job id that slurm will fill in automatically.
Next, we have three lines that initialise conda and activate the environment that we created for this project. At the moment, it is more convenient to use conda
for activation. The command is included with mamba
.
Finally, there is a snakemake
call. The first parameter specifies the environment relative to the location of Snakefile
. The last argument to the snakemake
command is "$@"
, which expands to all arguments passed to the script. This is useful if you need to specify additional parameters but do not want to include them in the script. For example, sbatch run.slurm --keepgoing --rerun-incomplete
.
Before running the pipeline, we need to create log/jobs
directory where slurm will be writing the output as well as the data
directory for the output files.
mkdir -p log/jobs data
Now, we can submit the whole pipeline with sbatch run.slurm
. You can see the status by running squeue -u $USER
. The pipeline should finish fairly quickly, in a couple of minutes. However, a typical pipeline make take days to complete.