Create an R Environment and Run a Job Array¶

This tutorial demonstrates the basics of how to create an R environment on ScienceCluster with specific packages of interest. It also demonstrates the use of a job array. This demonstration is for basic R usage without GPUs or large memory requirements.

Preparing the environment¶

To get started, log in to the cluster from a terminal, and load the mamba module:

module load mamba

You will use mamba to create a virtual environment for your R installation. Mamba is a drop-in replacement for conda.

To create a virtual environment called renv and install R within in, run

mamba create -n renv -c conda-forge r-base -y

To install a specific version of R, use this command

mamba create -n renv -c conda-forge "r-base==4.2.2" -y

And to activate the virtual environment, either from the login node or a compute node.

source activate renv

Once you've created the R environment, also consider whether you need other modules available on the cluster (e.g., GCC, which is provided as a module and can be loaded via a module load gcc command). Additionally, you should ensure that the packages you intend to use are also available in your R environment.

To run your code interactively on a compute node, instead of the login node, run

srun --pty -n 1 -c 4 --time=01:00:00 --mem=8G bash -l

This requests a one-hour interactive job with 4 CPUs and 8GB memory.

And then once on the compute node

module load mamba
source activate renv
R

Once the interactive R session has started, to see the packages that are installed by default on the system you can use the following command

installed.packages()

If the packages you need are listed after this command call, you can move forward to setting up your submission script. If you need to install more packages, you can install them from the interactive session using the appropriate command. For instance, if you want to use the tictoc package, from the interactive R session you would run

install.packages('tictoc')

Note

⚠️ Make sure to quit the interactive R session once you've successfully installed your packages of interest. To quit R, simply run q() and follow the prompts. You do not need to save your workspace image, so you can enter n when asked.

When you install packages for the first time, you will be prompted about whether you'd like to install the packages within a user library. You can input yes to these prompts if the default user space within the cluster suggested by R suits your needs. If you have a tremendous number of packages to install, and the user space within the home directory may not be enough space for these packages, instead consider passing another location in your data area via the lib argument of the install.packages() function. For example, to install the tictoc package into a directory titled 'rpackages' that exists within your user data area, you would first create the directory from the login node command line area with

mkdir -p /data/$USER/rpackages

Then you can then specify it as the location for your R packages using the lib argument from within an interactive R session run from the login node, like so

username = Sys.getenv()['USER']
install.packages('tictoc', lib=paste("/data/",username,"/rpackages",sep=""))

Note

⚠️ Notice that this code snippet (and the one below) uses the Sys.getenv() function to retrieve the $USER variable from the cluster environment. The paste() function is then use to construct the full path as a string. You can call print(paste("/data/",username,"/rpackages",sep="")) to see this string specifically after username = Sys.getenv()['USER'] has been called.

This new directory can be added to the .libPaths() in R, or you can simply specify this directory when you're loading the package. For example, to load the package you just installed in this new location, you would use the following line at the beginning of your R script. (Make sure to load your required packages within the R script that you're submitting.)

username = Sys.getenv()['USER']
library("tictoc", lib.loc=paste("/data/",username,"/rpackages",sep=""))

To deactivate the virtual environment, run

conda deactivate

Preparing the job submission script¶

Once you've installed the packages you'll use in your analysis you're ready to prepare the job submission script. This particular job submission script will use a job array. Job arrays are useful if you need to run a number of jobs across numerous datasets or parameter sets using the same analytical code. This command will create a submission script called arrayscript.sh.

cat << EOF > arrayscript.sh
#!/usr/bin/bash -l
#SBATCH --job-name=arrayJob
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-3

module load mamba
source activate renv
# Print this sub-job's task ID
echo "My SLURM_ARRAY_TASK_ID: " \$SLURM_ARRAY_TASK_ID
Rscript --vanilla testarray.R \$SLURM_ARRAY_TASK_ID
EOF

To view the contents of the file, run cat arrayscript.sh.

There are a few aspects of this submission script to note:

First, the --output and --error flags specify file names in a format that identifies them in terms of both the overall Job ID as well as the sub-job ID. In other words, when you submit a job array you receive a single Job ID for the overall submission, and every individual job within the array receives a sub-job ID. The format, as specified in these lines, will be similar to arrayJob_123456_1 where 123456 is an example Job ID and 1 is an example sub-job ID. Note: to achieve this format, the %A is used to represent the overall Job ID in the desired character string, and the %a is used to represent the sub-job ID in the desired character string.
Second, the --array flag here specifies the Bash array that is used for this job submission. Specifically, an array of 1-3 will expand so that there are 3 sub-jobs using 3 values (i.e., 1, 2, and 3). The array value itself for each sub-job can be used in both the job output and error files as well as within the submitted code script (see below). Other array values can be used; for example, --array=1,2,5,19,27 would specify the values 1, 2, 3, 5, 19, and 27. Alternatively, --array=1-7:2 would use values between 1 and 7 applying a step size of 2 (i.e., 1, 3, 5, and 7). Creative uses of the --array input and existing variables in the script will allow you to use this functionality with great flexibility. For example, you could use a simple set of array values (like --array=1-3) which when used as an index value with a list object in the analysis code script could retrieve any data value of interest.
Lastly, the line that reads echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID will print the Job ID of each sub-job in the output file to allow for greater readability of the output files. Moreover, adding $SLURM_ARRAY_TASK_ID to the end of the Rscript line will input the sub-job ID (in this case, a value of 1, 2, or 3) as an environment variable into the R script itself (see below).

Preparing the code to be run¶

The analysis code used in this example is quite simple, but it uses the $SLURM_ARRAY_TASK_ID environmental variable to demonstrate how environmental variables can be inputted into R code. To create the file testarray.R, run

cat << EOF > testarray.R
# Enable the use of command line argument inputs to the R script
args = commandArgs(TRUE)
input1 = args[1]
# Use the command line argument input in some way
fileName = paste0(input1,".csv")
integerValue = as.integer(input1)
write.csv(matrix(c(integerValue,integerValue,integerValue,integerValue), nrow=1), file=fileName, row.names=FALSE)
EOF

To view the contents of the file, run cat testarray.R.

As noted in the comments, the first section of the code imports the environment variable $SLURM_ARRAY_TASK_ID via the commandArgs() function. This would be equivalent to running, for example: Rscript --vanilla testarray.R 1. Once the array value has been passed to R, it can be used within the R script as a normal data variable. In this code, the array value is simply cast to an integer that is then written to a CSV. The expected output of this job submission is thus: three output log files, three error log files, and three CSVs written to the home area of the cluster (matching the array value of 1-3). Each file should correspond to the array sub-job ID.

Submitting the job¶

To the submit the script, ensure that both the submission script and the R script are in the same folder. Once these scripts are prepared, the modules have been loaded, and the R packages have been installed, simply run sbatch arrayscript.sh. When submitted, the console should print a message similar to

Submitted batch job <jobid>

where <jobid> is the Job ID numeric code assigned by the Slurm batch submission system.

Understanding job outputs¶

To reiterate, this example array job will produce a set of outputs corresponded to the array value of 1-3 that's used. For every sub-job submitted from the array, you should receive a .out output file (which contains the printed output from each of your sub-jobs), a .err error file (which logs any errors from each sub-job), and a .csv file that uses the array sub-job ID as the title.