Skip to content

How to run Snakemake on the ScienceCluster

This guide describes how to configure Snakemake so that it submits each job automatically to the cluster queue. Snakemake documentation provides basic guidelines for cluster execution. We extend those guidelines into a complete recipe for running Snakemake on the ScienceCluster.

The guide assumes that you are familiar with Snakemake and have already installed it into your user space. If you have not installed it yet, the easiest way would be to use the anaconda3 module and to install snakemake-minimal or snakemake packages into a new environment. In most cases, snakemake-minimal would be sufficient especially if you are relatively new to Snakemake.

Please note that the ScienceCluster does not support DRMAA.

Environment

We will use the following minimal environment.

name: snakemake_cluster
channels:
   - conda-forge
   - bioconda
   - defaults
dependencies:
   - python=3.9.6
   - snakemake-minimal=6.6.1

You can easily recreate it with conda by placing the definition to a yml file, e.g. snakemake_cluster.yml, and running conda env create -f snakemake_cluster.yml.

Snakefile

For illustrative purposes, the Snakefile will contain two rules. The first will run a couple of bash commands then sleep for a random number of seconds between 1 and 100. The second will count the number of characters in the output file that the first rule generates.

rule all:
   input:
      expand("data/small_job_{iteration}.txt", iteration=[1,2,3])

rule big_job:
   output:
      "data/big_job_{iteration}.txt"
   shell:
      """
      date '+%Y-%m-%d %H:%M:%S' > {output}
      hostname >> {output}
      echo "Host has $(ps -e | wc -l) processes running" >> {output}
      delay=$((1 + $RANDOM % 100))
      echo "Will sleep for $delay seconds" >> {output}
      sleep $delay
      date '+%Y-%m-%d %H:%M:%S' >> {output}
      """

rule small_job:
   input:
      "data/big_job_{iteration}.txt"
   output:
      "data/small_job_{iteration}.txt"
   shell:
      """
      date '+%Y-%m-%d %H:%M:%S' > {output}
      hostname >> {output}
      wc -c {input} >> {output}
      date '+%Y-%m-%d %H:%M:%S' >> {output}
      """

Cluster-specific configuration

For running the pipeline on the cluster, we are going to create a script called run.slurm with the following content.

#!/usr/bin/env bash

jobN=100
if [ "$1" ]; then
   jobN=$1
fi
shift

eval "$(conda shell.bash hook)"
conda activate snakemake_cluster

snakemake --cluster-config cluster.yml "$@" \
   -j $jobN \
   --cluster "sbatch "`
      `"-p {cluster.partition} "`
      `"--ntasks 1 "`
      `"--cpus-per-task {cluster.threads} "`
      `"--mem {cluster.mem} "`
      `"--time {cluster.time} "`
      `"-o {cluster.output} "`
      `"-e {cluster.error} "

The first line indicates that the script should be processed with bash when run.

The next section checks whether any arguments have been passed to the script. If no arguments have been supplied, it sets the total number of running and pending slurm jobs to the default value of 100.

The number of jobs that a user can submit depends on the qos (Quality of Service), which in turn depends on the job's requested runtime. You can see the current limits by running sacctmgr show qos format=name,maxwall,maxsubmitjobs,maxjobspu, At the time of writing, the following limits are enforced (subject to change).

Name MaxWall MaxSubmit MaxJobsPU
normal 1-00:00:00 10000
long 7-00:00:00 500
medium 2-00:00:00 5000
verylong 28-00:00:00 10 1
vesta 7-00:00:00 500 20
debug 12:00:00 5 1

MaxWall shows the maximum runtime; MaxSubmit is the number of jobs you can submit at a time to a specific partition under the corresponding qos; MaxJobsPU is the maximum number of jobs that can run in parallel on a specific partition under the corresponding qos. Empty value in the MaxJobsPU column means that the value is the same as for MaxSubmit.

Next, we have two lines that initialise conda and activate the environment that we created for this project.

Finally, there is a snakemake call. The first parameter specifies the cluster configuration file that we will discuss later. It is followed by "$@" that expands to all arguments passed to the script. Please note that if you want to pass additional parameters to the snakemake call, you should always specify the job number first. For example, ./run.slurm 100 --keepgoing (see below for details). Finally, there is a --cluster parameter that describes the command to submit the jobs.

Back ticks are used for string continuation, i.e. the value is just a single string equivalent to "sbatch -p {cluster.partition} --ntasks 1 --cpus-per-task {cluster.threads} ...". The variables in curly brackets will be taken from the cluster configuration file, cluster.yml. In our case, the configuration file is

__default__:
   name: "{rule}.{wildcards}"
   output: log/jobs/{rule}.{wildcards}.out
   error: log/jobs/{rule}.{wildcards}.err
   threads: 1
   mem: "1024M"
   time: "00:05:00"
   account: xxx.uzh
   partition: generic
big_job:
   threads: 4
   mem: "14800M"
   time: "1:00:00"
   partition: hpc

The default section specifies the default parameters for all jobs. It is highly recommended to provide default values for all variables used in run.slurm. After that, you can optionally have multiple sections with names that match your rule names. In those sections, you can override the default values. In our example, we only override the default values for the big_job rule. The small_job rule uses only the default values. All variables in the rule-specific sections are optional. If a variable is not defined in the rule section, the value from the __default__ section will be used.

In run.slurm file, the variable names have the form {cluster.var_name}. Thus, cluster. is a prefix and var_name should match a variable name in cluster.yml. We have the following variables. (Note that {rule} will be replaced with the rule name while {wildcards} will be replaced with all wildcards used in the rule.)

  • name: name of the job.
  • output: location of the output file relative to the working directory of Snakemake.
  • error: location of the error file. It could be the same as output.
  • threads: number of vCPUs to request for the rule.
  • mem: total amount of memory requested for the rule.
  • time: expected maximum runtime for an individual invocation of a rule.
  • account: project (tenant) name. To list your accounts, run sacctmgr show assoc format=account%30,partition user=your_username
  • partition: partition where the rule should be submitted.

Before running the pipeline, we need to create log/jobs directory where slurm will be writing the output.

mkdir -p log/jobs

Now, we can run the whole pipeline with ./run.slurm. You can see the status by running squeue -u your_username. The pipeline should finish fairly quickly, in a couple of minutes. However, a typical pipeline make take days to complete. To be able to disconnect from the server without terminating Snakemake, you would need to run it with nohup.

nohup ./run.slurm &> log/snakemake.log &

This will run the script in the background and redirect all standard and error output to log/snakemake.log. Since the process will be running in the background, you can now disconnect from the server without waiting for the pipeline completion.

Note

Here, we run the pipeline on a login node because at the moment only a single job can run under verylong qos. We plan to create a separate qos for snakemake pipelines that would allow running multiple jobs with minimal resource usage for a very long time. At that point, we will also introduce stricter time limits for processes running on the login nodes, so it will be necessary to run Snakemake pipelines on compute nodes.

In some cases, you may want to specify additional Snakemake parameters without adding them to run.slurm. This can be done by passing such parameters as arguments to the run.slurm script. In such cases, job number becomes required. For example, your can pass --keepgoing as

nohup ./run.slurm 100 --keepgoing &> log/snakemake.log &

The run.slurm script will strip the job number value and pass --keepgoing to snakemake.


Last update: July 30, 2021