Skip to content

Job submission

Select partition

Before you submit a job, you need to decide which partition would work the best for you. Once you have chosen the partition, you can enable it by loading the corresponding module. For example, the following command loads the correct environment for the generic partition.

module load generic

The alternative is to use an sbatch parameter (--partition) as described later. However, you must first load a partition module if you need to load any software modules.

CPU jobs

Jobs are typically submitted as bash scripts. At the top of such a script, you can specify various SBATCH parameters for your job submission, such as the amount of memory and the number of CPUs that you want to request. After that, you include the commands you want to execute. The script below simply writes the name of the node where the job will run to a file named job.out.

#!/usr/bin/env bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=100
#SBATCH --time=2:00
#SBATCH --output=job.out

srun hostname

The first line is a so-called shebang that specifies the interpreter for the file. In this case, the interpreter is bash. However, it could be any other interpreter such as tcsh, python, or R.

The job above requests 1 CPU (--cpus-per-task=1) and 100 MB of RAM (--mem=100) for 2 minutes (--time=2:00). These and other parameters are described below in greater detail.

It is essential to place all SBATCH directives immediately after the shebang. Otherwise, they will be ignored. If you do not specify any parameters, Slurm will allocate 1 vCPU, 1 MB of memory and 1 second for execution time. Since such allocation is insufficient for any real job, at a minimum you should specify the amount of memory and execution time.

If you save the script to the file named myjob, you can schedule its execution with

sbatch myjob

Upon successful submission, Slurm will print the Job ID.

If you have a compiled application, you can replace hostname command with your application call. In addition, you can sequentially run several commands within the same job script. Consider two hypothetical applications: convert_data and process_data. Each accepts certain parameters and both reside in your ~/data/bin directory. You can run process_data after convert_data with the following script.

#!/usr/bin/env bash
#SBATCH --cpus-per-task=2
#SBATCH --mem=7700
#SBATCH --time=30:00
#SBATCH --output=job.out

srun ~/data/bin/convert_data -i input.csv -o input.txt
srun ~/data/bin/process_data -i input.txt --threads=2 -o results.txt

If your application is available as a module, you need to load the module either in the job script before you make the call or on the command line before you submit the job. Below is an example with Mathematica where the module is loaded in the job script.

#!/usr/bin/env bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=3850
#SBATCH --time=30:00
#SBATCH --output=job.out

module load mathematica
srun wolframscript -file myscript.wls

To run a script written in an interpretable language, it is often necessary to configure the environment first. We provide examples for common languages, including R and Python.

GPU jobs

To schedule a GPU job on one of the GPU partitions, you would typically need to load the cuda module. This module becomes available only after you load one of the GPU partition modules, but you can specify all necessary modules in a single command.

module load volta cuda/10.2

In most cases, cudnn module would be required as well. Since a specific cudnn version is only compatible with certain cuda versions, we also provide modules that load the compatible combinations.

module load volta nvidia/cuda10.2-cudnn7.6.5

GPU jobs should explicitly request GPU resources by passing --gres gpu:1. The number at the end of the parameter value indicates the number of requested GPU devices. Typically, you would need only a single device.

GPU jobs normally need a relatively small amount of system memory. In most cases, it would be sufficient to request 4000 MB or even less; e.g., --mem=4000.

The sample script below requests a single GPU device and 4000 MB of system memory for 1 hour.

#!/usr/bin/env bash
#SBATCH --gres gpu:1
#SBATCH --mem=4000
#SBATCH --time=01:00:00
#SBATCH --output=job.out

nvidia-smi

Parameters

Slurm parameters can be specified either at the top of the job submission script with the #SBATCH prefix or on the command line. Parameters indicated on the command line override those in the job script. For example, the script below requests a 2 hour execution time and 2000 MB of RAM.

#!/usr/bin/env bash
#SBATCH --cpus-per-task=2
#SBATCH --mem=2000
#SBATCH --time=02:00:00

srun my_computations -t 2

However, you can schedule it to run for up to 4 hours with 4000 MB of RAM using the following command.

sbatch --time=04:00:00 --mem=4000 myscript

CPUs

The --cpus-per-task flag controls the number of CPUs that will be made available for each task of the job. By default, your job will have a single task. Jobs with multiple tasks are described in the Parallelisation section.

Warning

If you try to use more threads than the number of CPUs you have requested, those threads will not run all simultaneously but will be competing for CPU time with each other. This can significantly degrade the performance of your job and will be slower than running the matching number of threads. This is true even when each thread is using less than 100% of the CPU time.

Memory

There are two ways to request memory. You can specify the total amount with the --mem flag as in the examples above. Alternatively, you can use the --mem-per-cpu flag to request a certain amount for each requested CPU. The value is in MB, but GB can be specified with G suffix; e.g., --mem-per-cpu=4G. If your job allocates more memory than requested, Slurm may terminate it.

The memory per vCPU ratio is the same on all nodes of a particular partition. It is 4 GB/vCPU on generic, 8 GB/vCPU on hpc, and 24 GB/vCPU on hydra. However, Slurm has to reserve some memory for system operations. Therefore, the optimal hardware utilisation can only be achieved when a smaller amount of memory is requested per vCPU, namely 3850 for generic, 8000 for hpc, and 24100 for hydra. If you request multiples of those numbers with --mem or the exact numbers with --mem-per-cpu, you might be able to schedule more jobs to run in parallel than with 4G, 8G, or 24G on generic, hpc, and hydra respectively.

Time

You should strive to split your calculations into jobs that can finish in fewer than 24 hours. Short jobs are easier to schedule; i.e., they are likely to start earlier than long jobs. If something goes wrong, you might be able to detect it earlier. In case of a failure, you will be able to restart calculations from the last checkpoint rather than from the beginning. Finally, long jobs fill up the queue for extended periods and prevent other users from running their smaller jobs.

A job's runtime is controlled by the --time parameter. The value is formatted as dd-hh:mm:ss where dd is the number of days, hh - hours, mm - minutes, and ss - seconds. If the leading values are 0, they can be omitted. Thus, --time=2:00 means 2 minutes, --time=36:00:00 stands for 36 hours, and --time=1-12:30:00 requests 1 day, 12 hours, and 30 minutes.

If your job runs beyond the specified time limit, Slurm will terminate it. Depending on the value of the --time parameter, slurm automatically places jobs into one of the Quality of Service (QOS) groups, which in turn affects job scheduling priority as well as some other limits and properties. ScienceCluster has four different QOS groups.

  • normal: 24 hours
  • medium: 48 hours
  • long (vesta for GPU partitions): 7 days
  • verylong: 28 days

In order to be able to use the verylong qos (i.e., running times over 7 days), please request access via the S3IT issue tracker. A single user can run only one job with the verylong QOS at a time. If you schedule multiple verylong jobs, they will run serially regardless of the resource availability.

You can view the details of each QOS using the sacctmgr show qos command. For example,

sacctmgr show qos format=name,priority,maxwall,maxsubmit
will show the name of the QOS, the priority, the maximum wall time, and the maximum number of jobs per user that can be submitted to the queue.

Output and errors

A job's output can be saved to a custom file using the --output parameter. For example, --output=job.out indicates that all output that the script would have printed on the screen or console should be directed to the job.out file in your current working directory.

If you want the output file to be in a different location, you can use either an absolute path (e.g., --output=/scratch/username/logs/job.out) or a path relative to your working directory (e.g., --output=logs/job.out). In addition, you can specify a placeholder for Job ID denoted with %j; e.g.m --output=logs/job_%j.out. If you do not specify the --output parameter, the output will be directed to slurm-%j.out in your working directory.

Warning

The directory where you plan to write the output file should already exist. If it does not, the job will fail.

By default, Slurm writes error messages to the same file. However, you can redirect error messages to a different file by adding an extra parameter --error=job.err.

GPUs

GPUs are available only on the volta and vesta partitions. They can be requested in the list of generic consumable resources via the --gres parameter. Specifically, a single GPU device can be requested as --gres gpu:1. It is possible to request multiple GPUs as well, but please ensure that your job can actually consume them. Requesting more than 1 GPU without changing the way your applications run will not make them run faster and may increase the time your job waits in the queue.

The volta partition has nodes with a different amount of GPU memory. If your job fails because you run out of GPU memory, you can specifically request the higher-memory nodes by specifying the node type; i.e., --gres gpu:Tesla-V100-32GB:1. As with the number of GPU devices, you only need to do so when your application runs out of GPU memory on the nodes with 16 GB of on-board memory. Your job will not run faster on a high-memory node.

It is important to understand the difference between GPU and system memory. Each GPU device has its own on-board memory that is available only to the code that runs on that device. Code that runs directly on a GPU device does not consume system memory. However, other portions of the application may use system CPUs and require system memory. In such cases, it may be necessary to request a higher amount of system memory with --mem.

Partition

The partition is automatically selected when you load the corresponding partition module. However, you can override it by setting the --partition parameter. This might be useful when you have a pipeline with heterogeneous jobs that would be more appropriate to run on different partitions.

Project

If you have access to the cluster under two different projects (i.e., tenants), you can choose which project should be billed for the job by setting the --account parameter. You can find more details in the Account Info section.


Last update: March 3, 2022