Skip to content

Create an R Environment and Run a Job Array

This tutorial demonstrates how to create an R environment on ScienceCluster and how to run a job array within the R environment. It is intended for basic R workflows without GPUs or large memory requirements.

Workflow Overview

Following the general Conda environment workflow on ScienceCluster, we recommend the same principles here for integrating an R environment into job submission scripts:

  1. Set up your R environment: first start an interactive session. Then, create your R environment using mamba / conda and install the necessary packages within that session.

  2. Integrate environment into SLURM script: after installing the packages, exit the interactive session. On the login nodes, integrate the R environment into your SLURM script and submit the batch script from the login node.

Prepare the R environment

R environments on ScienceCluster are created using mamba or conda. For this reason, it is recommended to perform environment setup in an interactive session on a compute node rather than on the login nodes.

To create an R environment:

srun --pty -n 1 -c 4 --time=01:00:00 --mem=8G bash -l
module load mamba
mamba create -n renv -c conda-forge r-base -y

For more details, refer to the generic Conda-based R environment guide.

To activate the environment:

source activate renv

Once the R environment is created, consider whether you need additional cluster modules, e.g., GCC, which can be loaded with module load gcc.

Once the R environment is activated, start an interactive R session by running in the terminal:

R

Using an interactive R session, you can check which packages are installed or install any additional packages you need. You can install packages:

Once you have verified and installed all packages required for your workflow, you can proceed to setting up your job submission script.

Install packages to /data directory

When installing R packages for the first time, you will be prompted to choose whether to install them in a user library. You can safely answer yes if the default user library location suggested by R (typically in your home directory) meets your needs.

If you plan to install a large number of packages or anticipate that your home directory may not have sufficient space, you can instead specify an alternative location in your /data space using the lib argument of the install.packages() function.

For example, to install the ggplot2 package into a directory called rpackages in your /data area, first create the directory:

mkdir -p /data/$USER/rpackages

Then, from an interactive R session, specify this directory as the installation location using the lib argument:

username = Sys.getenv()['USER']
install.packages('ggplot2', lib=paste("/data/",username,"/rpackages",sep=""))

This snippet uses Sys.getenv() to retrieve your ScienceCluster $USER variable. The paste() function constructs the full path to your custom package directory.

You can either specify the directory with lib.loc explicitly when loading a package, or add this custom directory to .libPaths() in R.

  • Use lib.loc when loading package. For example, to load the ggplot2 package installed in your custom rpackages directory, include the following line at the beginning of your R script (remember to load all required packages within the script you submit):
username <- Sys.getenv("USER")
library("ggplot2", lib.loc = paste("/data/", username, "/rpackages", sep = ""))
  • Update .libPaths. To make R automatically search your custom package directory without specifying lib.loc each time, you can append the directory to .libPaths() at the start of your R script:
username <- Sys.getenv("USER")
custom_lib <- paste("/data/", username, "/rpackages", sep = "")
.libPaths(c(.libPaths(), custom_lib))

Prepare the job submission script

Once you've installed the packages you'll use in your analysis you're ready to prepare the job submission script. This particular job submission script will use a job array. Job arrays are useful if you need to run a number of jobs across numerous datasets or parameter sets using the same analytical code. This command will create a submission script called arrayscript.sh.

cat << EOF > arrayscript.sh
#!/usr/bin/bash -l
#SBATCH --job-name=arrayJob
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-3

module load mamba
source activate renv
# Print this sub-job's task ID
echo "My SLURM_ARRAY_TASK_ID: " \$SLURM_ARRAY_TASK_ID
Rscript --vanilla testarray.R \$SLURM_ARRAY_TASK_ID
EOF

To view the contents of the file, run cat arrayscript.sh.

There are a few aspects of this submission script to note:

  • First, the --output and --error flags specify file names in a format that identifies them in terms of both the overall Job ID as well as the sub-job ID. In other words, when you submit a job array you receive a single Job ID for the overall submission, and every individual job within the array receives a sub-job ID. The format, as specified in these lines, will be similar to arrayJob_123456_1 where 123456 is an example Job ID and 1 is an example sub-job ID. Note: to achieve this format, the %A is used to represent the overall Job ID in the desired character string, and the %a is used to represent the sub-job ID in the desired character string.

  • Second, the --array flag here specifies the Bash array that is used for this job submission. Specifically, an array of 1-3 will expand so that there are 3 sub-jobs using 3 values (i.e., 1, 2, and 3). The array value itself for each sub-job can be used in both the job output and error files as well as within the submitted code script (see below). Other array values can be used; for example, --array=1,2,5,19,27 would specify the values 1, 2, 3, 5, 19, and 27. Alternatively, --array=1-7:2 would use values between 1 and 7 applying a step size of 2 (i.e., 1, 3, 5, and 7). Creative uses of the --array input and existing variables in the script will allow you to use this functionality with great flexibility. For example, you could use a simple set of array values (like --array=1-3) which when used as an index value with a list object in the analysis code script could retrieve any data value of interest.

  • Lastly, the line that reads echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID will print the Job ID of each sub-job in the output file to allow for greater readability of the output files. Moreover, adding $SLURM_ARRAY_TASK_ID to the end of the Rscript line will input the sub-job ID (in this case, a value of 1, 2, or 3) as an environment variable into the R script itself (see below).

Prepare the R code to be run

The analysis code used in this example is quite simple, but it uses the $SLURM_ARRAY_TASK_ID environmental variable to demonstrate how environmental variables can be inputted into R code. To create the file testarray.R, run

cat << EOF > testarray.R
# Enable the use of command line argument inputs to the R script
args = commandArgs(TRUE)
input1 = args[1]
# Use the command line argument input in some way
fileName = paste0(input1,".csv")
integerValue = as.integer(input1)
write.csv(matrix(c(integerValue,integerValue,integerValue,integerValue), nrow=1), file=fileName, row.names=FALSE)
EOF

To view the contents of the file, run cat testarray.R.

As noted in the comments, the first section of the code imports the environment variable $SLURM_ARRAY_TASK_ID via the commandArgs() function. This would be equivalent to running, for example: Rscript --vanilla testarray.R 1. Once the array value has been passed to R, it can be used within the R script as a normal data variable. In this code, the array value is simply cast to an integer that is then written to a CSV. The expected output of this job submission is thus: three output log files, three error log files, and three CSVs written to the home area of the cluster (matching the array value of 1-3). Each file should correspond to the array sub-job ID.

Submitting the job

To the submit the script, ensure that both the submission script and the R script are in the same folder. Once these scripts are prepared, the modules have been loaded, and the R packages have been installed, simply run sbatch arrayscript.sh. When submitted, the console should print a message similar to

Submitted batch job <jobid>

where <jobid> is the Job ID numeric code assigned by the Slurm batch submission system.

Understanding job outputs

To reiterate, this example array job will produce a set of outputs corresponded to the array value of 1-3 that's used. For every sub-job submitted from the array, you should receive a .out output file (which contains the printed output from each of your sub-jobs), a .err error file (which logs any errors from each sub-job), and a .csv file that uses the array sub-job ID as the title.