Job submission¶
CPU jobs¶
Jobs are typically submitted as bash scripts. At the top of such a script, you can specify various SBATCH
parameters for your job submission, such as the amount of memory and the number of CPUs that you want to request. After that, you include the commands you want to execute. The script below simply writes the name of the node where the job will run to a file named job.out
.
#!/usr/bin/env bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=100
#SBATCH --time=2:00
#SBATCH --output=job.out
hostname
The first line is a so-called shebang that specifies the interpreter for the file. In this case, the interpreter is bash
. However, it could be any other interpreter such as tcsh
, python
, or R
.
The job above requests 1 CPU (--cpus-per-task=1
) and 100 MB of RAM (--mem=100
) for 2 minutes (--time=2:00
). These and other parameters are described below in greater detail.
It is essential to place all SBATCH
directives immediately after the shebang. Otherwise, they will be ignored. If you do not specify any parameters, Slurm will allocate 1 vCPU, 1 MB of memory and 1 second for execution time. Since such allocation is insufficient for any real job, at a minimum you should specify the amount of memory and execution time.
If you save the script to the file named myjob
, you can schedule its execution with
sbatch myjob
Upon successful submission, Slurm will print the Job ID.
If you have a compiled application, you can replace hostname
command with your application call. In addition, you can sequentially run several commands within the same job script. Consider two hypothetical applications: convert_data
and process_data
. Each accepts certain parameters and both reside in your ~/data/bin
directory. You can run process_data
after convert_data
with the following script.
#!/usr/bin/env bash
#SBATCH --cpus-per-task=2
#SBATCH --mem=7700
#SBATCH --time=30:00
#SBATCH --output=job.out
~/data/bin/convert_data -i input.csv -o input.txt
~/data/bin/process_data -i input.txt --threads=2 -o results.txt
It is also possible to run several commands in parallel as described in the Advanced topics below.
If your application is available as a module, you need to load the module either in the job script before you make the call or on the command line before you submit the job. Below is an example with Matlab where the module is loaded in the job script.
Warning
You should not load flavour modules (e.g. intel
, infiniband
, gpu
, etc.) in the job script. These modules set slurm constraints that may differ from the job constraints. This may lead to problems with resource allocations for individual steps particularly when you use srun
for step execution.
#!/usr/bin/env bash
#SBATCH --cpus-per-task=1
#SBATCH --mem=3850
#SBATCH --time=30:00
#SBATCH --output=job.out
module load matlab
matlab -nodisplay -nosplash -nodesktop -r "run('process_data.m');exit;"
To run a script written in an interpretable language, it is often necessary to configure the environment first. We provide examples for common languages, including R and Python.
GPU jobs¶
To schedule a GPU job, you would typically need to load a GPU module (gpu
, multigpu
) or a module for a specific GPU type (t4
, v100
, v100-32g
, a100
). Cuda and related modules become available only after you load one of these modules, but you can specify all necessary modules in a single command.
Warning
You should not load GPU flavour modules (e.g. gpu
, t4
, a100
, etc.) in the job script. These modules set slurm constraints that may differ from the job constraints. This may lead to problems with resource allocations for individual steps particularly when you use srun
for step execution.
module load multigpu cuda/11.8.0
In most cases, cudnn
module would be required as well. Since a specific cudnn version is only compatible with certain cuda versions, we also provide modules that load the compatible combinations.
module load multigpu cudnn/8.7.0
GPU jobs should explicitly request GPU resources by passing --gpus=1
. The parameter value indicates the number of requested GPU devices.
GPU jobs normally need a relatively small amount of system memory. In most cases, it would be sufficient to request 4000 MB per GPU or even less, e.g. --mem=4000
.
The sample script below requests a single GPU device and 4000 MB of system memory for 1 hour.
#!/usr/bin/env bash
#SBATCH --gpus=1
#SBATCH --mem=4000
#SBATCH --time=01:00:00
#SBATCH --output=job.out
nvidia-smi
Low priority queue¶
A special queue lowprio
is available for GPU jobs that have a limited job duration and are of lower priority. In situations where the cluster is highly utilized your GPU job might start earlier using this option. Submit your job with the --partition lowprio
flag, and with a maximum time of 24 hours.
Parameters¶
Slurm parameters can be specified either at the top of the job submission script with the #SBATCH
prefix or on the command line. Parameters indicated on the command line override those in the job script. For example, the script below requests a 2 hour execution time and 2000 MB of RAM.
#!/usr/bin/env bash
#SBATCH --cpus-per-task=2
#SBATCH --mem=2000
#SBATCH --time=02:00:00
my_computations -t 2
However, you can schedule it to run for up to 4 hours with 4000 MB of RAM using the following command.
sbatch --time=04:00:00 --mem=4000 myscript
CPUs¶
The --cpus-per-task
flag controls the number of CPUs that will be made available for each task of the job. By default, your job will have a single task. Jobs with multiple tasks are described in the Parallelisation section.
Warning
If you try to use more threads than the number of CPUs you have requested, those threads will not run all simultaneously but will be competing for CPU time with each other. This can significantly degrade the performance of your job and will be slower than running the matching number of threads. This is true even when each thread is using less than 100% of the CPU time.
Memory¶
There are two ways to request memory. You can specify the total amount with the --mem
flag as in the examples above. Alternatively, you can use the --mem-per-cpu
flag to request a certain amount for each requested CPU. The value is in MB, but GB can be specified with G
suffix, e.g. --mem-per-cpu=4G
. If your job allocates more memory than requested, Slurm may terminate it.
Time¶
You should strive to split your calculations into jobs that can finish in fewer than 24 hours. Short jobs are easier to schedule, i.e. they are likely to start earlier than long jobs. If something goes wrong, you might be able to detect it earlier. In case of a failure, you will be able to restart calculations from the last checkpoint rather than from the beginning. Finally, long jobs fill up the queue for extended periods and prevent other users from running their smaller jobs.
A job's runtime is controlled by the --time
parameter. The value is formatted as dd-hh:mm:ss
where dd
is the number of days, hh
- hours, mm
- minutes, and ss
- seconds. If the leading values are 0, they can be omitted. Thus, --time=2:00
means 2 minutes, --time=36:00:00
stands for 36 hours, and --time=1-12:30:00
requests 1 day, 12 hours, and 30 minutes.
If your job runs beyond the specified time limit, Slurm will terminate it. Depending on the value of the --time
parameter, slurm automatically places jobs into one of the Quality of Service (QOS) groups, which affects several limits and properties such as the allowed number of running and pending jobs per user.
ScienceCluster has four different QOS groups.
- normal: 24 hours
- medium: 48 hours
- long: 7 days
- verylong: 28 days
In order to be able to use the verylong
qos (i.e. running times over 7 days), please contact Science IT. A single user can run only one job with the verylong
QOS at a time. If you schedule multiple verylong
jobs, they will run serially regardless of the resource availability.
You can view the details of each QOS using the sacctmgr show qos
command. For example,
sacctmgr show qos format=name,priority,maxwall,maxsubmit,maxtrespu%30,maxjobspu
This command will show the name of the QOS, the priority, the maximum wall time, the maximum number of jobs per user that can be submitted to the queue, the maximum resources you can request, and the maximum number of jobs you can run.
Output and errors¶
A job's output can be saved to a custom file using the --output
parameter. For example, --output=job.out
indicates that all output that the script would have printed on the screen or console should be directed to the job.out
file in your current working directory.
If you want the output file to be in a different location, you can use either an absolute path (e.g. --output=/scratch/$USER/logs/job.out
) or a path relative to your working directory (e.g. --output=logs/job.out
). In addition, you can specify a placeholder for Job ID denoted with %j
, e.g. --output=logs/job_%j.out
. If you do not specify the --output
parameter, the output will be directed to slurm-%j.out
in your working directory.
Warning
The directory where you plan to write the output file should already exist. If it does not, the job will fail.
By default, Slurm writes error messages to the same file. However, you can redirect error messages to a different file by adding an extra parameter --error=job.err
.
GPUs¶
GPUs can be requested in the list of generic consumable resources via the --gpus
parameter. Specifically, a single GPU device can be requested as --gpus=1
. It is possible to request multiple GPUs as well, but please ensure that your job can actually consume them. Requesting more than 1 GPU without changing the way your applications run will not make them run faster and may increase the time your job waits in the queue.
Some nodes have GPUs with a different amount of GPU memory. If your job fails because you run out of GPU memory, you can specifically request the higher-memory nodes by specifying the node type and a memory constraint, i.e. --gpus=V100:1 --constraint=GPUMEM32GB
. As with the number of GPU devices, you only need to do so when your application runs out of GPU memory on the nodes with 16 GB of on-board memory. Your job will not run faster on a high-memory node. For convenience we've added a module for V100's with 32GB GPU RAM, you can simply call module load v100-32g
before submitting your script.
If you need at least 32GB of GPU RAM and you don't have a preference between the 32GB V100 or 80GB A100 GPUs, you can use the following two flags when submitting your job or interactive session: --gpus=1 --constraint="GPUMEM32GB|GPUMEM80GB"
. You will then receive whichever GPU is first available (and cost contributions will apply according to the GPU you receive).
It is important to understand the difference between GPU and system memory. Each GPU device has its own on-board memory that is available only to the code that runs on that device. Code that runs directly on a GPU device does not consume system memory. However, other portions of the application may use system CPUs and require system memory. In such cases, it may be necessary to request a higher amount of system memory with --mem
.
Project¶
If you have access to the cluster under two different projects (i.e., tenants), you can choose which project should be billed for the job by setting the --account
parameter. You can find more details in the Account Info section.
Advanced topics¶
A Slurm job can look like a generic shell script where multiple commands run one by one in a serial fashion. However, it is also possible to send each command to run in the background and wait for their completion. This will effectively run them in parallel but they will be competing for the same allocated resources and Slurm will report statistics (MaxRSS, AveCPU, etc.) only for the whole job.
This can be illustrated with a job that runs a simple python script that consumes the specified amount of memory and then sleeps for the specified number of seconds. The implementation of the script is not important. The batch script looks as follows.
$ cat parallel_bg
#!/usr/bin/env bash
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --time=5:00
#SBATCH --output=log/parallel_bg
./eat_and_sleep.py 1 70 &
./eat_and_sleep.py 2 90 &
wait
The script requests 2 CPUs (note that --ntasks
is 1 by default and a task encompasses all the commands in the script) and 4GB of memory in total. It runs the script twice, the first time requesting 1GB of memory and the sleep time of 70 seconds and the second time requesting 2GB of memory and 90 seconds of sleep. Both calls run in the background. The wait
call at the end ensures that the batch waits until both subprocesses terminate. Without it, the script will exit immediately without waiting for the scripts to complete.
For this job, Slurm reports the following information.
$ sacct -o jobid,jobname,reqcpus,reqmem,maxrss,avecpu,start,elapsed -j 343
JobID JobName ReqCPUS ReqMem MaxRSS AveCPU Start Elapsed
------------ ----------- -------- ---------- ---------- ---------- ------------------- ----------
343 parallel_bg 2 4G 2023-11-15T11:02:54 00:01:52
343.batch batch 2 3160732K 00:00:07 2023-11-15T11:02:54 00:01:52
343.extern extern 2 96K 00:00:00 2023-11-15T11:02:54 00:01:52
As you can see, the usage is aggregated for the batch. (The extern
line shows the information for Slurm processes that are servicing the job and for the most part can be ignored.) It is also important to point out that the job resources are shared. So, if each script requested 3GB, one of them would run out of memory and fail.
To address these issues, it is possible to run the scripts as separate job steps. This is achieved by scheduling the script calls with srun
as shown below.
$ cat parallel_steps
#!/usr/bin/env bash
#SBATCH --cpus-per-task=2
#SBATCH --mem=4G
#SBATCH --time=5:00
#SBATCH --output=log/parallel_steps
srun -n 1 -c 1 --mem=1G ./eat_and_sleep.py 1 70 &
srun -n 1 -c 1 --mem=2G ./eat_and_sleep.py 2 90 &
wait
Each invocation is a single task as indicated by -n 1
. The steps receive 1 CPU each and different amounts of memory from the total resources allocated for the batch. A step cannot request more resources than allocated for the batch. If job steps request in total more resources than allocated for the batch, Slurm will issue a warning and delay the execution of some steps to ensure that the allocation is not exceeded.
For the second batch, Slurm reports the following information.
$ sacct -o jobid,jobname,reqcpus,reqmem,maxrss,avecpu,start,elapsed -j 344
JobID JobName ReqCPUS ReqMem MaxRSS AveCPU Start Elapsed
------------ ------------ -------- ---------- ---------- ---------- ------------------- ----------
344 parallel_st+ 2 4G 2023-11-15T11:07:19 00:01:33
344.batch batch 2 8008K 00:00:00 2023-11-15T11:07:19 00:01:33
344.extern extern 2 72K 00:00:00 2023-11-15T11:07:19 00:01:33
344.0 eat_and_sle+ 1 1055956K 00:00:01 2023-11-15T11:07:19 00:01:12
344.1 eat_and_sle+ 1 2104528K 00:00:02 2023-11-15T11:07:19 00:01:33
Now, it is possible to see that the second step used more memory, more CPU time, and finished later than the first step.
Overall, it looks very similar to a job array. The main difference is that each job in a job array goes through the queue to get resources allocated. The resources for a batch job with steps are allocated only once. After that, the allocated resources are distributed among the steps. Thus, it is typically more beneficial to use job arrays for longer tasks because the scheduling overhead and initial preparation would be negligible compared to the runtime. On the other hand, it may be more efficient to run tasks as job steps when they do require considerable preprocessing but can be completed relatively quickly.
It is possible to submit job steps interactively after allocating resources with salloc
. It has essentially the same parameters as sbatch
. However, you do not submit a batch script but rather launch an interactive shell on the current node, from which you can launch steps with srun
. The steps from the example above can be launched via salloc
as follows.
$ salloc --cpus-per-task=2 --mem=9G --time=10:00
$ srun -n 1 -c 1 --mem=3G ./eat_and_sleep.py 1 70 &
$ srun -n 1 -c 1 --mem=6G ./eat_and_sleep.py 2 90 &
Note
Without any redirection the output from the steps will be printed to screen. Unlike sbatch
, salloc
does not support the --output
parameter.
Since salloc
opens a shell on the same node, from which you are submitting the job (typically, login node), the debugging possibilities are somewhat limited. In particular, top
and ps
would show processes that run on the login node rather than on the node where the steps are running. Therefore, a more convenient approach might be to schedule an interactive session with srun
and then launch screen or tmux on the compute node or start your task in the background and monitor in the foreground.