Data storage and transfer¶
There are three complementary filesystems where you can store your data.
|home|| || ||15 GB, 100k files||Redundant, SSD|
|data|| || ||200 GB||Redundant, SSD|
|scratch|| || ||20 TB||Redundant, HDD|
In addition, shared group storage is provided, which requires cost contributions but has no associated quota.
Each user has a home directory where configuration files, source code, and other small important files can be stored. The directory has a limit of 100,000 files and/or 15 GB of used space. The quota makes it impractical for large data storage or software installations.
For persistent storage of larger files, you can use the data filesystem (
/data/<username>). It has a limit of 200 GB, but it is not backed up (as is the case also for the other storage). This filesystem is also appropriate for software installations. For example, if you use Conda, you should consider changing the location of your
.conda directory from the
home directory to your
The scratch filesystem (
/scratch/<shortname>) is for the temporary storage of large input data files used during your calculations. Each user has a quota of 20 TB. The maximum file size is limited to 10 TB. Please note that this filesystem is meant for temporary storage only. According to the service agreement, any files older than one month are subject to deletion.
Scalable group storage has no quota but requires a cost contribution based on the actual usage. The default permissions are set so that each member of the project has access to the shared folder, which can be found at this path:
/net/cephfs/shares/matrix.uzh. (In this case, replace
matrix.uzh with your actual project name.)
You can create a symlink called
shares in your home directory that points to this shared group folder:
ln -s /net/cephfs/shares/matrix.uzh ~/shares
You can transfer files with the
scp command. The first argument is the source file while the second argument indicates the target location. For example, you can copy a file from your computer to the
data directory on the cluster by running the following command on your computer.
scp my_local_file.txt firstname.lastname@example.org:data
To copy a file from the cluster, you specify the server and the remote path as the first argument and local path as the second. For example, you can copy
job_results.txt that resides in your
scratch directory on the cluster to your computer by running the following command on your computer.
scp email@example.com:/scratch/<username>/job_results.txt .
. (i.e., "dot") character stands for the current directory. You can specify any other location either with an absolute path or path that is relative to your current directory.
As well, you can transfer the whole directory using an
scp -r my/local/dir firstname.lastname@example.org:scratch/target
However, for transfers that involve many files or directories, it is often more efficient to use
rsync. This program synchronises files between the source and destination. Thus, if your transfer fails or if only some of your files have been updated,
rsync would be more efficient as it does not transfer the identical data present in both locations. For example, the following command can be used in place of the previous
rsync -az --progress my/local/dir email@example.com:scratch/target
scp, the first location is the source file/directory while the second is the target location. The
-a flag invokes the archive mode that, roughly speaking, recreates the structure and permissions of the source directory on the target machine. The
-z flag instructs
rsync to compress the data before the transfer, which can make the transfer faster especially when your connection speed is low. As the name suggests, the
--progress option would show the transfer progress information.
Before running the synchronisation, you can run the command with
-n to preview which files will be transferred. It is necessary to specify
--progress in this case. Otherwise,
rsync will not display any output.
rsync -azn --progress my/local/dir firstname.lastname@example.org:scratch/target
You can exclude files and directories from synchronisation with
--exclude. This parameter can be specified multiple times. For example, the following command will ignore all files and directories named
cache as well as all files that have
rsync -azn --progress --exclude='cache' --exclude='*.tmp' my/local/dir email@example.com:scratch/target
rsync does not remove any local files even if they have been deleted from the source directory. The deletion of old files can be enabled with
--delete. It is strongly recommended to preview the changes with
-n before running
rsync with the
--delete flag. If you specify the wrong target directory, all files in that directory will be deleted without confirmation.
rsync -az --progress --delete firstname.lastname@example.org:scratch/source my/local/target
Trailing slash at the end of the source directory instructs
rsync to synchronise the contents of the source directory rather than the directory itself. Let us suppose, for example, that the source directory
scratch/data has one single file
test.txt. If you do not specify the trailing slash (i.e.,
rsync will create
data directory in your local directory and transfer the contents there.
rsync -az email@example.com:scratch/data my/local/target ls my/local/target # data ls my/local/target/data # test.txt
If you add the trailing slash
rsync will place
test.txt directly into your target directory.
rsync -az firstname.lastname@example.org:scratch/data/ my/local/target ls my/local/target # test.txt
Data sharing among cluster users is conducted using Active Directory (AD) groups. Each AD group corresponds with a matching account in the ScienceCluster.
If you would like to share data with other users in your group, you'll need the name of the account in the ScienceCluster to which your username belongs (and you'll need to use this name to construct the appropriate command for sharing data with the other members of this group; see below for more details).
For example, if user
asmith would like to share the
project1 directory with the
matrix.uzhgroup, the group ownership could be changed recursively.
$ chgrp -R S3IT_T_hpc_matrix.uzh /scratch/asmith/project1 $ ls -ld /scratch/asmith/project1 drwxrwx--- 1 asmith S3IT_T_hpc_matrix.uzh 1 May 26 12:26 /scratch/asmith/project1 $ ls -l /scratch/asmith/project1/ -rw-rw---- 1 asmith S3IT_T_hpc_matrix.uzh 0 May 26 12:26 data.txt.xz
Note: the group argument used in the command is
S3IT_T_hpc_matrix.uzh. A group titled
esheep.uzh would use
S3IT_T_hpc_esheep.uzh. In other words, add
S3IT_T_hpc_ before the group/account name to construct the correct argument for the command.
To continue, in this example only the members of
matrix.uzh will be able to access
project1 and only if they know the exact path. Alternatively,
asmith can choose to share his whole scratch directory in a read-only manner with the
$ chgrp -R S3IT_T_hpc_matrix.uzh /scratch/asmith $ chmod -R g+rX /scratch/asmith