Data Encryption ๐
Overview ๐
Some workflows involve sensitive data that must be encrypted at rest, i.e., data that has to be decrypted on-the-fly while a job runs. For this, we describe here a solution based on Gocryptfs, an encrypted overlay filesystem.
The gocryptfs setup requires two directories:
- The encrypted storage directory, from hereafter
data-encrypted. - The plain, unencrypted view of the data (
data-plain). It is important to notice that this directory is empty on the actual storage system, and it is used by gocryptfs to show an unencrypted version of the data indata-encrypted.
This approach assumes that the node the job is running on (i.e., the one that needs to decrypt the data) and the PALMA management servers are trustworthy.
A compromise of either would allow someone else to steal your data. Security can be slightly increased by running your job with the --exclusive flag to prevent other jobs from running on the same node while the data is decrypted.
Whether this is worth the overhead of potentially blocking more resources than necessary is your decision.
- The encrypted storage is protected by a password, and protection is only as strong as the password.
- If the password (and the master key) are lost, there is no way to restore the data.
Setting up your encrypted directory and the mountpoint ๐
The following commands allow you to 1) load the gocryptfs module, 2) create two directories data-encryptedand data-plain and 3) initialize the encrypted directory:
module load gocryptfs
mkdir data-encrypted data-plain
gocryptfs -init data-encrypted
The last command will prompt you to choose a password and will provide you with a master key, e.g.:
[user@r07m01 ~]$ gocryptfs -init data-encrypted
Choose a password for protecting your files.
Password:
Repeat:
Your master key is:
1a88a6b1-8f072fe8-7aac5356-1d025115-
7574f7c3-627cbbdb-12b96ca8-09bfb39a
If the gocryptfs.conf file becomes corrupted or you ever forget your password,
there is only one hope for recovery: The master key. Print it to a piece of
paper and store it in a drawer. This message is only printed once.
The gocryptfs filesystem has been created successfully.
Interactive usage ๐
Once you have set up your encrypted storage you can interactively start it as follows:
module load gocryptfs
gocryptfs data-encrypted data-plain
The last command will prompt you for the password previously decided:
[user@r07m01 ~]$ gocryptfs data-encrypted data-plain
Password:
Decrypting master key
Filesystem mounted and ready.
At this point, you can use data-plain just like any other directory, e.g. to copy data to it:
cp datafile.csv data-plain/
In data-plain, the file will show up as datafile.csv, but it will actually be stored in
data-encrypted under an encrypted file name.
Remember to stop gocryptfs when you are done, by unmounting the user space file system:
fusermount -u data-plain
After this, data-plain is empty again: the data only lives in data-encrypted.
Usage in a batch job ๐
1. Store your password in an encrypted environment variable ๐
This step is to distribute safely the password to each compute node. One way to do this is by using munge, the authentication service used by Slurm.
export GOCRYPTFSPASSWORD=$(mask-passwd | munge -i /dev/stdin)
Please note that MUNGE relies on a shared secret between the nodes. Anyone with knowledge of this secret (i.e., the system administrators) can decrypt munge-encrypted information. The MUNGE-encrypted password should thus still be treated as a sensitive piece of information and should not be written to a file on the shared filesystem. Instead, we will pass it to the batch job via an environment variable, so it is only stored in the SLURM database on one of the management servers.
This procedure needs to be repeated for every new session before submitting a job that requires gocryptfs decryption.
2. Create your sbatch script ๐
In order to ensure that no decrypted data remains once the job is terminated, we create the data-plain in the temporary storage of the slurm job, as it will be unmounted and erased at the job's conclusion.
This also means that you will have to tell to your scripts the path $TMPDIR/data-plain to access the data.
#!/bin/bash
# ... various SBATCH requirements ...
module load gocryptfs
mkdir $TMPDIR/data-plain
gocryptfs -passfile=<(echo "$GOCRYPTFSPASSWORD" | unmunge -i /dev/stdin -m /dev/null) \
data-encrypted $TMPDIR/data-plain
# do your work here
# Finally, unmount the decrypted filesystem
fusermount -u $TMPDIR/data-plain
Additional sbatch flags ๐
Security can be slightly increased by running your job with the --exclusive flag to prevent other jobs from running on the same node while the data is decrypted.
Note that this is quite a costly measure in terms of resource allocation, in the face of a small increase in security: other (non-root) users cannot read the content of the decrypted folder on the compute node.
Additional gocryptfs flags ๐
The -sharedstorage option is necessary when more than one host accesses the encrypted storage, e.g. in a multi-node job or in simultaneous single-node jobs using the same data or when using gocryptfs to access the data on the login node while a job is running.
Note that you cannot use --chdir=path/to/data-plain in your sbatch script because that directory does not provide the view of the encrypted data yet at job start time.
You can cd path/to/data-plain after mounting data-plain, but you have to remember to change back into another directory before unmounting.
After the job runs ๐
Even if the data-plain mountpoint gets erased after the job is done, you can still access your data -and the result of your computations- using the instructions specified under "Interactive usage".