Skip to content

How to run Snakemake on ScienceCluster

This guide describes how to configure Snakemake so that it submits each job automatically to the cluster queue. Snakemake documentation provides basic guidelines for cluster execution. The approach based on cluster config has been deprecated in favour of resources and profiles. Although a slurm profile is available, it has certain limitations, in particular the low job submission rate, which can make it impractical for large pipelines. There is also a simpler implementation that does not have the job submission rate problem. Below we provide an even simpler implementation. Even though it does not provide the same functionality as the official profile, it does not have any known limitations. It is loosely (but not entirely) based on the (archived) blog post by Sichong Peng.

The guide assumes that you are familiar with Snakemake and have already installed it into your user space. If you have not installed it yet, the easiest way would be to use the mamba or anaconda3 module and to install snakemake-minimal or snakemake packages into a new environment. In most cases, snakemake-minimal is sufficient especially if you are relatively new to Snakemake.

Mamba is a package manager that is fully compatible with conda but performs better on certain tasks such as resolving requirements. For those reasons, we will use mamba in this example. To load the module, run module load mamba.

Missing

ScienceCluster does not support DRMAA.

Conda environment

We will use the following minimal conda environment. You may want to update to the newest versions of python and snakemake.

name: snakemake_cluster
channels:
   - conda-forge
   - bioconda
   - defaults
dependencies:
   - python
   - snakemake-minimal

You can easily recreate it with mamba by placing the definition to a yml file, e.g. snakemake_cluster.yml, and running mamba env create -f snakemake_cluster.yml.

Cluster profile config

The profile configuration file defines the command used to submit jobs as well as default resource specifications for jobs. We will store it in env/slurm/config.yaml. You can create the directory structure with mkdir -p env/slurm where slurm is the profile's name. Now, you can create the file env/slurm/config.yaml and place the cluster parameters there. The advantage of this approach as opposed to storing the profile in the default location (see the Snakemake manual for details) is that the slurm profile will be added to your git repository.

jobs: 100
cluster: >-
    sbatch
    --ntasks 1
    --cpus-per-task {resources.threads}
    --mem {resources.mem_mb}
    --time {resources.time}
    -o log/jobs/{rule}_{wildcards}_%j.out
    -e log/jobs/{rule}_{wildcards}_%j.err
default-resources:
    - threads=1
    - mem_mb=3850
    - time="1:00:00"

The first parameter, jobs, defines the maximum number of jobs snakemake would keep running and pending simultaneously. The number of jobs that a user can submit depends on the qos (Quality of Service), which in turn depends on the job's requested runtime. You can see the current limits by running sacctmgr show qos format=name,priority,maxwall,maxsubmit,maxtrespu%30,maxjobspu, At the time of writing, the following limits are enforced (subject to change).

Name Priority MaxWall MaxSubmit MaxTRESPU MaxJobsPU
normal 0 1-00:00:00 10000
medium 0 2-00:00:00 5000
long 0 7-00:00:00 500 cpu=1024,gres/gpu=24 24
verylong 0 28-00:00:00 10 cpu=64,gres/gpu=4 1
debug 50000 00:04:00 1 cpu=32,gres/gpu=2,mem=128G 1

Priority is the additional priority a job gets. MaxWall shows the maximum runtime; MaxSubmit is the number of jobs you can submit at a time. MaxTRESPU is the highest amount of special resources a user can reserve. MaxJobsPU is the maximum number of jobs that can run in parallel. Empty value in the MaxJobsPU column means that the value is the same as for MaxSubmit.

The second parameter, cluster, specifies the command to be used for job submission. The only change you may possibly need here is the location of the output and error files. In the definition above, the file names consist of the rule name, wildcards and slurm's job id. You have to create the log directories in advance or the jobs will fail. In particular, you will need to run mkdir -p log/jobs in the same directory where your Snakefile resides.

Caution

In the cluster parameter, >- is used for line wrapping. The first character indicates that all line breaks except the one at the very end should be removed. The second character instructs to remove the line break at the very end. Spacing is crucial here. If a line has a different offset than the previous line, a line break will be inserted into the command and job submission will fail.

The last parameter, default-resources, specifies the default resources a job will be allocated if the respective values are not defined in the rules within Snakefile. The values are passed directly to slurm, so they should have the same format as the corresponding slurm parameters. For example, time can be specified as 24:00:00 (24 hours) or as 1-00:00:00 (1 day).

Snakefile

For illustrative purposes, the Snakefile contains two rules. The first runs a couple of bash commands then sleeps for a random number of seconds between 1 and 100. The second counts the number of characters in the output file that the first rule generates. In addition, both rules would print the information that slurm has about the submitted jobs. This will help us determine if the resources were allocated correctly.

rule all:
   input:
      expand("data/small_job_{iteration}.txt", iteration=[1,2,3])

rule big_job:
   output:
      "data/big_job_{iteration}.txt"
   resources:
      mem_mb=7700,
      time="2:00:00"
   shell:
      """
      date '+%Y-%m-%d %H:%M:%S' > {output}
      hostname >> {output}
      echo "Host has $(ps -e | wc -l) processes running" >> {output}
      scontrol show jobid=$SLURM_JOBID >> {output}
      delay=$((1 + $RANDOM % 100))
      echo "Will sleep for $delay seconds" >> {output}
      sleep $delay
      date '+%Y-%m-%d %H:%M:%S' >> {output}
      """

rule small_job:
   input:
      "data/big_job_{iteration}.txt"
   output:
      "data/small_job_{iteration}.txt"
   shell:
      """
      date '+%Y-%m-%d %H:%M:%S' > {output}
      hostname >> {output}
      wc -c {input} >> {output}
      scontrol show jobid=$SLURM_JOBID >> {output}
      date '+%Y-%m-%d %H:%M:%S' >> {output}
      """

Since the big_job rule is expected to run longer and require more memory, we override the default mem_mb and time parameters in the resources section of that rule. All other resource-related parameters will be taken from the profile that we have defined previously. Respectively, since the small_job rule lacks the resources section, Snakemake will obtain all parameters from the profile.

Batch script

For running the pipeline on the cluster, we are going to create a script called run.slurm with the following content.

#!/usr/bin/env bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=3850
#SBATCH --time=1-00:00:00
#SBATCH --output=log/main_%j

module load mamba/4.14.0
eval "$(conda shell.bash hook)"
conda activate snakemake_cluster

snakemake --profile ./env/slurm "$@"

The first line indicates that the script should be processed with bash when run. After that, there are several slurm parameters for the batch script. It will be using 1 thread and 3850 MB RAM for at most one day. The output will be saved to the file named log/main_%j where %j is the slurm job id that slurm will fill in automatically.

Next, we have three lines that initialise conda and activate the environment that we created for this project. At the moment, it is more convenient to use conda for activation. The command is included with mamba.

Finally, there is a snakemake call. The first parameter specifies the environment relative to the location of Snakefile. The last argument to the snakemake command is "$@", which expands to all arguments passed to the script. This is useful if you need to specify additional parameters but do not want to include them in the script. For example, sbatch run.slurm --keepgoing --rerun-incomplete.

Before running the pipeline, we need to create log/jobs directory where slurm will be writing the output as well as the data directory for the output files.

mkdir -p log/jobs data

Now, we can submit the whole pipeline with sbatch run.slurm. You can see the status by running squeue -u $USER. The pipeline should finish fairly quickly, in a couple of minutes. However, a typical pipeline make take days to complete.