Skip to content

R Example script with a job array

This tutorial demonstrates the basics of how to create an R environment on the ScienceCluster with specific packages of interest. It also demonstrates the use of a job array.

Preparing the environment

To begin, log in to the cluster and load a partition that you'd like to use. In this demonstration, given that there are no special requirements for GPUs or for a large amount of memories or a large number of CPU's, you will load the Generic partition. To do so, from the ScienceCluster command line run

module load generic

After loading the Generic partition module, you can run module av to see the available software from the cluster. As you'll notice, r is one of the available modules (with multiple versions). You can either run your code using the supplied version(s) of R, or you can build a custom R environment using a Conda environment or a container.

To load the default version of R on the cluster, run

module load r

Once you've loaded the R environment, also consider whether you need other modules available on the cluster (e.g., GCC, which is provided as a module and can be loaded via a module load gcc command). Additionally, you should ensure that the packages you intend to use are also available in your R environment. To do so, once the R module has been loaded, you can start an interactive R session on the Login Node using the R command.

Warning

⚠️ To protect the integrity of the cluster for yourself and other users, only use interactive R sessions for exploring available packages and installing new packages. I.e., do NOT run your analytical code this way.

Once the interactive R session has started, to see the packages that are installed by default on the system you can use the following command

installed.packages()

If the packages you need are listed after this command call, you can move forward to setting up your submission script. If you need to install more packages, you can install them from the interactive session using the appropriate command. For instance, if you want to use the tictoc package, from the interactive R session you would run

install.packages('tictoc')

Note

⚠️ Make sure to quit the interactive R session once you've successfully installed your packages of interest. To quit R, simply run q() and follow the prompts. You do not need to save your workspace image, so you can enter n when asked.

When you install packages for the first time, you will be prompted about whether you'd like to install the packages within a user library. You can input yes to these prompts if the default user space within the cluster suggested by R suits your needs. If you have a tremendous number of packages to install, and the user space within the home directory may not be enough space for these packages, instead consider passing another location in your data area via the lib argument of the install.packages() function. For example, to install the tictoc package into a directory titled 'rpackages' that exists within your user data area, you would first create the directory from the login node command line area with

mkdir -p /data/$USER/rpackages

Then you can then specify it as the location for your R packages using the lib argument from within an interactive R session run from the login node, like so

username = Sys.getenv()['USER']
install.packages('tictoc', lib=paste("/data/",username,"/rpackages",sep=""))

Note

⚠️ Notice that this code snippet (and the one below) uses the Sys.getenv() function to retrieve the $USER variable from the cluster environment. The paste() function is then use to construct the full path as a string. You can call print(paste("/data/",username,"/rpackages",sep="")) to see this string specifically after username = Sys.getenv()['USER'] has been called.

This new directory can be added to the .libPaths() in R, or you can simply specify this directory when you're loading the package. For example, to load the package you just installed in this new location, you would use the following line at the beginning of your R script. (Make sure to load your required packages within the R script that you're submitting.)

username = Sys.getenv()['USER']
library("tictoc", lib.loc=paste("/data/",username,"/rpackages",sep=""))

Preparing the job submission script

Once you've installed the packages you'll use in your analysis you're ready to prepare the job submission script. This particular job submission script will use a job array. Job arrays are useful if you need to run a number of jobs across numerous datasets or parameter sets using the same analytical code. For example,

#!/bin/bash
#SBATCH --job-name=arrayJob
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4000
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-3
# Print this sub-job's task ID
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
srun Rscript --vanilla testarray.R $SLURM_ARRAY_TASK_ID
Save this code as a Bash shell script to your home/cluster/<your_username> location as arrayscript.sh.

There are a few aspects of this submission script to note:

  • First, the --output and --error flags specify file names in a format that identifies them in terms of both the overall Job ID as well as the sub-job ID. In other words, when you submit a job array you receive a single Job ID for the overall submission, and every individual job within the array receives a sub-job ID. The format, as specified in these lines, will be similar to arrayJob_123456_1 where 123456 is an example Job ID and 1 is an example sub-job ID. Note: to achieve this format, the %A is used to represent the overall Job ID in the desired character string, and the %a is used to represent the sub-job ID in the desired character string.

  • Second, the --array flag here specifies the Bash array that is used for this job submission. Specifically, an array of 1-3 will expand so that there are 3 sub-jobs using 3 values (i.e., 1, 2, and 3). The array value itself for each sub-job can be used in both the job output and error files as well as within the submitted code script (see below). Other array values can be used; for example, --array=1,2,5,19,27 would specify the values 1, 2, 3, 5, 19, and 27. Alternatively, --array=1-7:2 would use values between 1 and 7 applying a step size of 2 (i.e., 1, 3, 5, and 7). Creative uses of the --array input and existing variables in the script will allow you to use this functionality with great flexibility. For example, you could use a simple set of array values (like --array=1-3) which when used as an index value with a list object in the analysis code script could retrieve any data value of interest.

  • Lastly, the line that reads echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID will print the Job ID of each sub-job in the output file to allow for greater readability of the output files. Moreover, adding $SLURM_ARRAY_TASK_ID to the end of the Rscript line will input the sub-job ID (in this case, a value of 1, 2, or 3) as an environment variable into the R script itself (see below).

Preparing the code to be run

The analysis code used in this example is quite simple, but it uses the $SLURM_ARRAY_TASK_ID environmental variable to demonstrate how environmental variables can be inputted into R code. The code is as follows

# Enable the use of command line argument inputs to the R script
args = commandArgs(TRUE)
input1 = args[1]

# Use the command line argument input in some way
fileName = paste(input1,".csv",sep="")
integerValue = as.integer(input1)
write.csv(matrix(c(integerValue,integerValue,integerValue,integerValue), nrow=1), file=fileName, row.names=FALSE)

and should be saved in your home/cluster/<your_username> location with the file name testarray.R.

As noted in the comments, the first section of the code imports the environment variable $SLURM_ARRAY_TASK_ID via the commandArgs() function. Once the array value has been imported in this way, it can be used within the R script as a normal data variable. In this code, the array value is simply cast to an integer that is then written to a CSV. The expected output of this job submission is thus: three output log files, three error log files, and three CSV's written to the home area of the cluster (matching the array value of 1-3). Each file should correspond to the array sub-job ID.

Submitting the job

To the submit the script, ensure that both the submission script and the R script are both located in the home/cluster/<your_username> location. Once these scripts are prepared, the modules have been loaded, and the R packages have been installed, simply run sbatch arrayscript.sh. When submitted, the console should print a message similar to

Submitted batch job <jobid>

where <jobid> is the Job ID numeric code assigned by the SLURM Batch Submission system.

Understanding job outputs

To reiterate, this example array job will produce a set of outputs corresponded to the array value of 1-3 that's used. For every sub-job submitted from the array, you should receive a .out output file (which contains the printed output from each of your sub-jobs), a .err error file (which logs any errors from each sub-job), and a .csv file that uses the array sub-job ID as the title.


Last update: March 21, 2022