Skip to content

Python Example Script with TensorFlow

This tutorial demonstrates the basics of how to create a Python environment on the ScienceCluster with specific packages of interest, in this case TensorFlow.

Preparing the environment

To begin, log in to the cluster and load a partition that you'd like to work with. In this demonstration, given that TensorFlow is the package of interest, you will load one of the GPU enable partitions (e.g., Vesta). To do so, from the ScienceCluster command line run

module load vesta

Since you'll be using GPUs in this example, you'll also need to load GPU processing software that works with TensorFlow. It is also important to ensure that the version of TensorFlow is compatible with the GPU processing software being used. So, it is recommended to specify the versions explicitly. You can view the available versions by running module av, which you can then specify when you load the software. For example,

module load nvidia/cuda11.2-cudnn8.1.0

Once the partition has been chosen and the GPU software has been loaded, you should then construct an encapsulated environment into which you can install all of the necessary software dependencies. One tool that you can use for this purpose is Anaconda. To load Anaconda, run

module load anaconda3

Then, run the following set of commands line-by-line to construct a Conda environment (named "tensorflowexample") and install TensorFlow into it.

# This line creates the environment (with a specific version of Python)
# Note: if you do not specify a version of Python when you create the environment,
# the system's default Python will be used. If you install additional packages using Conda,
# a newer version of Python may be installed and then made default in the environment.
conda create --name tensorflowexample python=3.9
# This line activates the environment (notice the change in your Terminal)
source activate tensorflowexample
# This line installs TensorFlow into the environment (including other packages that are required)
conda install tensorflow-gpu
# This line deactivates the environment so additional commands can be issued
conda deactivate

Warning

The conda-forge channel does not currently have tensorflow-gpu package but it is often included in search with higher priority than the defaults channel. When tensorflow is installed from conda-forge and tensorflow-gpu comes from defaults, TensorFlow will not be able to detect GPUs. If you have the conda-forge channel enabled in your .condarc, you have to ignore it during installation, i.e. you should install the package with conda install -c defaults tensorflow-gpu.

Preparing the job submission script

Once the Conda environment is created and prepared with the packages of interest it can then be activated from within the job submission script. The example job submission script for this demonstration is

#!/bin/bash
#SBATCH --time=00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=3000
#SBATCH --gres gpu:1
source activate tensorflowexample
srun python examplecode.py

which is saved as the file tfsubmission.sh in the home directory of the cluster (i.e., home/cluster/<your_username>).

Note

⚠️ Please observe that the --gres gpu:1 flag is included in this batch submission script. This flag enables the use of GPUs that come equipped in the Vesta (and Volta) cluster partitions. Slurm will reject any jobs submitted to the GPU partitions without this flag.

Preparing the code to be run

The Python code that is being called in the script is

import tensorflow as tf;
print(tf.config.list_physical_devices('GPU'));
print();
tf.test.gpu_device_name();
print();
tf.test.is_built_with_cuda();
print();
from tensorflow.python.client import device_lib;
print(device_lib.list_local_devices())

which is also saved in the home/cluster/<your_username> location with the file name examplecode.py.

Submitting the job

To submit this script for processing (after the modules have been loaded and the Conda environment has been created), simply run sbatch tfsubmission.sh. When submitted, the console should print a message similar to

Submitted batch job <jobid>

where <jobid> is the Job ID numeric code assigned by the SLURM Batch Submission system.

Understanding job outputs

When the job runs to completion (provided your submitted code does not produce any errors) any/all files outputted by your script should have been written to their designated locations and a file named slurm-<jobid>.out should exist within your Cluster storage area. This file contains the printed output from your job.


Last update: October 26, 2021