Skip to content

Job management

Job information

You can use squeue utility to see the list of all running jobs. Without any parameters, it will show the list of all jobs that are running or pending on the whole cluster. If you have previously loaded a partition module, then the output will be limited to the jobs scheduled on that partition.

squeue

You can limit the output to display only your jobs -u <username> or jobs from specific partition -p <partition>.

squeue -u asmith -p generic

The output includes basic job information such as job id, user, and requested resources. Jobs cannot exceed the END_TIME but they can terminate earlier. NODELIST(REASON) column shows the list of nodes for the running jobs or the reason for pending jobs. The most common reason is (priority), which means that the job is not running because it has lower priority than some other scheduled jobs. Pending jobs with highest priority will have (resources) as the reason. Occasionally, you may see (ReqNodeNotAvail). In most cases, it means that a reservation has been placed on partition nodes due to an upcoming maintenance and your job cannot start as its runtime may overlap with the maintenance window.

Information about jobs that ran previously can be obtained with sacct utility. The most common parameters are listed below.

  • -S <date> displays jobs that started after the specified date. Date should be in ISO format, e.g. '2020-07-27'. You can also specify time, e.g. '2020-07-27 14:30'.
  • -s <state> limits the output to jobs in specific states, e.g. -s FAILED,TIMEOUT would show jobs that failed or timed out.
  • -j <jobid> shows the information for the specified job only.

For example,

sacct -S '2020-07-27' -s COMPLETED
sacct -j 2905691

Jobs that successfully finished should have COMPLETED state and 0:0 exit code.

Among the default output columns, you may find MaxRSS particularly useful. It shows the maximum amount of RAM your job consumed at some point during its execution. This information can be used to adjust the amount of requested RAM for similar jobs in the future.

There are many other fields that you can request. You can see the whole list by running sacct --helpformat. The output format can be controlled with the -o parameter, which accepts a comma-separated list of fields.

sacct -S '2020-07-27' -s COMPLETED -o jobid,start,reqgres,reqmem,maxrss

In some cases, a column may not be wide enough to fit entire values. sacct appends a plus sign to the end of truncated values. You can increase column width by adding %x to the column names specified with -o. Here, x is the width of the corresponding column in characters. For example, the following command expands the width of JobID and ReqMem columns to 9 and 15 characters respectively.

sacct -S '2020-07-27' -s COMPLETED -o jobid%9,start,reqgres,reqmem%15,maxrss

Cancelling jobs

You can remove a pending or running job from the queue with scancel. Typically, you would use it with specific job ids.

scancel 2905690 2905690

However, it is possible to delete all your jobs that satisfy certain criteria. For example, you can delete all jobs that are scheduled on the generic partition.

scancel -p generic

You can delete all pending jobs that are scheduled on the hpc partition.

scancel -p hpc --state=PENDING

The command also has an interactive mode whereby it would ask you to confirm the deletion of each job before actually deleting them. The mode is enabled with -i flag.

scancel -i --state=RUNNING

Last update: July 30, 2021