Create an R Environment and Run a Job Array¶
This tutorial demonstrates how to create an R environment on ScienceCluster and how to run a job array within the R environment. It is intended for basic R workflows without GPUs or large memory requirements.
Workflow Overview¶
Following the general Conda environment workflow on ScienceCluster, we recommend the same principles here for integrating an R environment into job submission scripts:
-
Set up your R environment: first start an interactive session. Then, create your R environment using
mamba
/conda
and install the necessary packages within that session. -
Integrate environment into SLURM script: after installing the packages, exit the interactive session. On the login nodes, integrate the R environment into your SLURM script and submit the batch script from the login node.
Prepare the R environment¶
R environments on ScienceCluster are created using mamba
or conda
. For this reason, it is recommended to perform environment setup in an interactive session on a compute node rather than on the login nodes.
To create an R environment:
srun --pty -n 1 -c 4 --time=01:00:00 --mem=8G bash -l
module load mamba
mamba create -n renv -c conda-forge r-base -y
For more details, refer to the generic Conda-based R environment guide.
To activate the environment:
source activate renv
Once the R environment is created, consider whether you need additional cluster modules, e.g., GCC, which can be loaded with module load gcc
.
Once the R environment is activated, start an interactive R session by running in the terminal:
R
Using an interactive R session, you can check which packages are installed or install any additional packages you need. You can install packages:
- in your default user library, or
- in a custom directory
Once you have verified and installed all packages required for your workflow, you can proceed to setting up your job submission script.
Install packages to /data
directory¶
When installing R packages for the first time, you will be prompted to choose whether to install them in a user library. You can safely answer yes
if the default user library location suggested by R (typically in your home
directory) meets your needs.
If you plan to install a large number of packages or anticipate that your home
directory may not have sufficient space, you can instead specify an alternative location in your /data
space using the lib
argument of the install.packages()
function.
For example, to install the ggplot2
package into a directory called rpackages
in your /data
area, first create the directory:
mkdir -p /data/$USER/rpackages
Then, from an interactive R session, specify this directory as the installation location using the lib
argument:
username = Sys.getenv()['USER']
install.packages('ggplot2', lib=paste("/data/",username,"/rpackages",sep=""))
This snippet uses Sys.getenv()
to retrieve your ScienceCluster $USER
variable. The paste()
function constructs the full path to your custom package directory.
You can either specify the directory with lib.loc
explicitly when loading a package, or add this custom directory to .libPaths()
in R.
- Use
lib.loc
when loading package. For example, to load theggplot2
package installed in your customrpackages
directory, include the following line at the beginning of your R script (remember to load all required packages within the script you submit):
username <- Sys.getenv("USER")
library("ggplot2", lib.loc = paste("/data/", username, "/rpackages", sep = ""))
- Update
.libPaths
. To make R automatically search your custom package directory without specifyinglib.loc
each time, you can append the directory to.libPaths()
at the start of your R script:
username <- Sys.getenv("USER")
custom_lib <- paste("/data/", username, "/rpackages", sep = "")
.libPaths(c(.libPaths(), custom_lib))
Prepare the job submission script¶
Once you've installed the packages you'll use in your analysis you're ready to prepare the job submission script. This particular job submission script will use a job array. Job arrays are useful if you need to run a number of jobs across numerous datasets or parameter sets using the same analytical code. This command will create a submission script called arrayscript.sh
.
cat << EOF > arrayscript.sh
#!/usr/bin/bash -l
#SBATCH --job-name=arrayJob
#SBATCH --time=01:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4GB
#SBATCH --output=arrayJob_%A_%a.out
#SBATCH --error=arrayJob_%A_%a.err
#SBATCH --array=1-3
module load mamba
source activate renv
# Print this sub-job's task ID
echo "My SLURM_ARRAY_TASK_ID: " \$SLURM_ARRAY_TASK_ID
Rscript --vanilla testarray.R \$SLURM_ARRAY_TASK_ID
EOF
To view the contents of the file, run cat arrayscript.sh
.
There are a few aspects of this submission script to note:
-
First, the
--output
and--error
flags specify file names in a format that identifies them in terms of both the overall Job ID as well as the sub-job ID. In other words, when you submit a job array you receive a single Job ID for the overall submission, and every individual job within the array receives a sub-job ID. The format, as specified in these lines, will be similar toarrayJob_123456_1
where123456
is an example Job ID and1
is an example sub-job ID. Note: to achieve this format, the%A
is used to represent the overall Job ID in the desired character string, and the%a
is used to represent the sub-job ID in the desired character string. -
Second, the
--array
flag here specifies the Bash array that is used for this job submission. Specifically, an array of1-3
will expand so that there are 3 sub-jobs using 3 values (i.e.,1
,2
, and3
). The array value itself for each sub-job can be used in both the job output and error files as well as within the submitted code script (see below). Other array values can be used; for example,--array=1,2,5,19,27
would specify the values1
,2
,3
,5
,19
, and27
. Alternatively,--array=1-7:2
would use values between1
and7
applying a step size of 2 (i.e.,1
,3
,5
, and7
). Creative uses of the--array
input and existing variables in the script will allow you to use this functionality with great flexibility. For example, you could use a simple set of array values (like--array=1-3
) which when used as an index value with a list object in the analysis code script could retrieve any data value of interest. -
Lastly, the line that reads
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
will print the Job ID of each sub-job in the output file to allow for greater readability of the output files. Moreover, adding$SLURM_ARRAY_TASK_ID
to the end of theRscript
line will input the sub-job ID (in this case, a value of1
,2
, or3
) as an environment variable into the R script itself (see below).
Prepare the R code to be run¶
The analysis code used in this example is quite simple, but it uses the $SLURM_ARRAY_TASK_ID
environmental variable to demonstrate how environmental variables can be inputted into R
code. To create the file testarray.R
, run
cat << EOF > testarray.R
# Enable the use of command line argument inputs to the R script
args = commandArgs(TRUE)
input1 = args[1]
# Use the command line argument input in some way
fileName = paste0(input1,".csv")
integerValue = as.integer(input1)
write.csv(matrix(c(integerValue,integerValue,integerValue,integerValue), nrow=1), file=fileName, row.names=FALSE)
EOF
To view the contents of the file, run cat testarray.R
.
As noted in the comments, the first section of the code imports the environment variable $SLURM_ARRAY_TASK_ID
via the commandArgs()
function. This would be equivalent to running, for example: Rscript --vanilla testarray.R 1
. Once the array value has been passed to R, it can be used within the R
script as a normal data variable. In this code, the array value is simply cast to an integer that is then written to a CSV. The expected output of this job submission is thus: three output log files, three error log files, and three CSVs written to the home area of the cluster (matching the array value of 1-3
). Each file should correspond to the array sub-job ID.
Submitting the job¶
To the submit the script, ensure that both the submission script and the R
script are in the same folder. Once these scripts are prepared, the modules have been loaded, and the R packages have been installed, simply run sbatch arrayscript.sh
. When submitted, the console should print a message similar to
Submitted batch job <jobid>
where <jobid>
is the Job ID numeric code assigned by the Slurm batch submission system.
Understanding job outputs¶
To reiterate, this example array job will produce a set of outputs corresponded to the array value of 1-3
that's used. For every sub-job submitted from the array, you should receive a .out
output file (which contains the printed output from each of your sub-jobs), a .err
error file (which logs any errors from each sub-job), and a .csv
file that uses the array sub-job ID as the title.