Datalad ๐
Datalad is a python framework to organize data in a way that is almost equivalent to the management of code with git, meaning that you can manage your data in a decentralized way, with version control. It furthermore gives you the ability to do so reproducibly, i.e. Datalad writes down what you did to alter your dataset.
This page does not aim to give you a full introduction to or documentation of Datalad, which you can get best from the authors themselves on the handbook page of Datalad. The page is specifically tailored to beginners, such that even if you have never used git before, you will be able to follow the guide provided. Their page is a valuable resource you should check out when starting to work with Datalad. Large parts of this site are a condensed version of the tutorials given there with special focus on the use case on PALMA. The general idea is simple: All data is either primary data (say samples you took) or secondary data (data that is derived from your samples, say that got processed by a script or so).
Datalad uses git and git-annex in order to manage data with version control. However, it cleans up the user experience of these tools a little and provides some custom features that make it easier to work with. Furthermore, Datalad can automatically record the commands you use to change the dataset and can re-do them, if you need to. This feature also enables others to view the changes you made in a straightforward way.
While git is a household name, git-annex maybe needs some introduction even for well-versed command-line users and is not to be confused with git-lfs. It enables git to handle large and binary files (such as the ones you have as data storage). It does so, by only tracking a link to the actual file content and while the actual content is placed in a hidden directory. Git-annex can upload and download data from a wide range of back-ends, and Datalad implements some functionality directly.
Use case ๐
There are multiple use cases in which Datalad can be helpful:
- As it tracks the alterations to your dataset, even after some time, you can make sure that you remember how the data you generated came to be.
- You can publish or share your data with your collaborators using git (e.g. the gitlab instance hosted at git.nrw or similar platforms). Note that especially for large datasets, you should not use git repositories like git.nrw for storage of the actual data. Instead, you can use one of the many special remotes supported by git annex like an S3 bucket or even Sciebo.
- Datalad allows you to only download the data you actually need, meaning that you have a good overview over your data at all times, while maintaining small footprint if not an entire dataset is needed. This is particularly useful for cases in which you make large scale calculations on a cluster like PALMA and need to follow those up locally on your laptop.
- You can easily push your data to different storage providers, not needing to rely on one storage solution (see the git-annex site for the supported protocols). This is particularly handy for scientists, as you can off-load your data easily, when you are changing positions or similar.
- It can ease data migration between clusters as you can automatically sync your data from an external data source instead of having to manually sync between clusters.
Basic usage ๐
In the handbook, you can find how to set up a new repository using the
datalad create <my repository>
command. This is, for the git users, about the same as git init <my repository>
For ease of use, datalad lets you add and commit changes to your dataset in one command, namely
datalad save <path>
if no path is given, the entire dataset is saved.
We will overlook everything that has to do with siblings (git and git-annex remotes) only note that, contrary to git, Datalad knows three types of remotes: pure git remotes, remotes with an attached storage and pure storage remotes.
To make sense of the usage on the HPC system, we also note the following special command:
datalad run -m "My message" -i <input files> -o <output files> "<command>"
This lets you execute the given command in the dataset you are currently in, saves the changes and writes down what you did to change the dataset. Datalad accomplishes this by writing the command in a specific format into a git commit. The documentation page may give you more information, a proper tutorial can be found here. Apart from this, there are the
datalad push
command to push a file to a remote and the
datalad get <file>
command that gets the content of a file from a storage sibling.
Usage on PALMA ๐
Datalad is installed on PALMA via EasyBuild. You can load it in software stacks from palma/2022b upwards via
module load palma/xxxx
module load GCCcore
module load datalad
There are a few things to notice about using Datalad on the cluster. The basic, interactive usage is the same as on your local computer, however running Datalad in SLURM jobs needs some attention.
Running jobs with Datalad ๐
As mentioned in the HPC section of the Datalad Handbook, the command datalad run is not working well with parallel jobs (i.e. two SLURM jobs in the same dataset).
If we want to run jobs with Datalad, we need to take care of this as to also make the commands from our job-script reproducible.
There are two workflows to consider, which are explained in the next two paragraphs. Shortly comparing the two, we find:
- Single repository workflow
- Pros:
- only small changes to your normal workflow required
- faster to set up
- compute nodes do not have to make redundant copies of your datasets, meaning less time spent before the real calculation starts
- Cons:
- can be unstable when running multiple jobs in the same dataset (~10 was the maximum in experience)
- can lead to less clean data environments, unfinished jobs are not discarded
- Pros:
- Temporary repository workflow
- Pros:
- clean setup in regard to how git is used
- can run many parallel jobs at once (~100 parallel jobs in one dataset at the same time were seen to not lead to issues)
- Cons:
- harder to set up
- could lead to quota problems if set up with scratch as a temporary directory
- need to consolidate data in an extra step
- compute nodes receive overhead, as they need to copy the data to a temporary location
- Pros:
Single repository workflow ๐
The first (rather simple one) uses the --explicit keyword of datalad run. The datalad save command, that is done at the end of the run, will then only apply to the output files given in the arguments of the command.
This has the huge advantage of not needing to open new clones of the repository, as in the next example, however, one has to take care of cleaning ones own repository. This workflow was found to be more performant, but can lead to issues when a job does not finish for any reason, as then the commit is not made and we have "unsaved" work in our dataset. (This may, however be beneficial, if you specifically need your job to finish to process it further.) An example for a job-script is given by:
#!/bin/bash
##### SETUP SLURM OPTIONS AS USUAL
#SBATCH --nodes=...
#SBATCH --ntasks=...
#SBATCH --partition=...
#SBATCH --time=...
#SBATCH --mail-type=...
#SBATCH --mail-user=...
#SBATCH --job-name=...
#SBATCH -o ...
...
# load modules
module load palma/xxxx
module load GCCcore
module load datalad
module load ...
# define variables needed for job
...
# execute
datalad run -m "My message" -i <input files> -o <output files> --explicit "<your normal job command>"
This workflow can have issues with multiple jobs running in the same dataset, as only one job at a time is allowed to save changes to the dataset. This usually becomes a problem if you have 15+ jobs wanting to save at the same time.
You can attempt to circumvent this issue by using flock and wait until the command currently blocking the save is done. From experience this has indeed eased the problem in some cases.
Temporary repository workflow ๐
Alternatively, one can use the following workflow, which is based on this article and has a few advantages over the original workflow suggested in the Datalad handbook. In particular, it keeps the cost of cloning the dataset to a minimum, as we do not need to clone the output data, but instead only the input-dataset.
The idea is, to do calculations in temporary copies of your dataset. The calculated data is pushed to a save place and the temporary copy is deleted. To do so, we set up the following structure, in which we basically have three stages of the data:
- input & preparation (interactive, on the headnode)
- calculation phase (your SLURM job)
- completed calculation (input data + the new files coming from your calculations)
To implement this structure, we first create a dataset, in which we put only all input files (that includes scripts, job-scripts, binaries etc., every thing to make the job work).
From this dataset, we can start the jobs on the cluster.
Then, we set up another dataset using the datalad create-sibling command (see also the documentation).
This will be, where we collect all data in the end of the jobs. Additionally, we set up a second sink sibling, e.g. a RIA store, which has its own command. The advantage of this is, that we can push all data to the ria store, while the sink sibling is only in charge of keeping track of the changes (which is comparatively quick) while loading the data to another place is comparatively slow.
You can set up the needed repositories up to this point with the following commands:
This can be done with the following commands:
# create input repo
datalad create <nameofrepository>
cd $WORK
# create first output repo
datalad clone <path/to/repository> ./<nameofrepository>_sink
cd $HOME/<nameofreprotitory>
# make sure the input repo knows about the output
datalad siblings add -s sink $WORK/<nameofrepository>_sink
# create ria-store sibling to push the data to
datalad create-sibling-ria -s ria --new-store-ok --alias <nameofrepository> --shred group --group <mymaingroup> -R 0 ria+file:$WORK/my_ria
Keep in mind that this is only a suggested configuration, and it is up to you how to implement it for your workflow. Now you can fill the input repo with the data needed for your calculations. With Datalad, it is possible to use submodules, as in git. Therefore, if you have large datasets that are needed for input, it is suggested to use a sub-dataset, as you do not need to copy large amounts of data in this case.
Finally, we alter the SLURM script in such a way, that it first copies the data needed for the job to a temporary location, executes the job and pushes it to our sink siblings. The basic steps for this are:
SBATCH options
setup working directory
setup separate datalad dataset
run script
push results to output datasets
Example:
#!/bin/bash
##### SETUP SLURM OPTIONS AS USUAL
#SBATCH --nodes=...
#SBATCH --ntasks=...
#SBATCH --partition=...
#SBATCH --time=...
#SBATCH --mail-type=...
#SBATCH --mail-user=...
#SBATCH --job-name=...
#SBATCH -o ...
...
# pick source repo and temporary working directory
MYSOURCE=<input repository>
MYSINK=<sink repository>
WDIR=/tmp/slurm_${SLURM_JOB_USER}.${SLURM_JOB_ID}
# load modules
module load palma/xxxx
module load GCCcore
module load datalad
module load ...
# clone the datalad dataset into the working directory
cd ${WDIR}
# flock is used here for the case, that two jobs start at the same time
flock ${MYSOURCE}.lock datalad clone ${MYSOURCE} ds_${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}
cd ds_${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}
# prepare dataset for the job, checkout into new branch
git annex dead here
git checkout -b job_${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}
# initialize any additional variables needed
outdir=...
# run the actual job
datalad run -i "specify one input here" -i "and the next one here" -o "same for outputs" "<PROGRAM GOES HERE>"
# if your job consists out of multiple programs, your should consider placing them in seperate datalad run commands.
datalad push --to ria
flock ${MYSINK}.lock git push sink job_${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}
In the case above, we are using /tmp/slurm_${SLURM_JOB_USER}.${SLURM_JOB_ID} as our temporary file system.
Be careful, while SLURM makes this location available on all nodes that you use, the directories are not the same (i.e. r12n03 and r12n04 have different temporary directories).
If your program reads on multiple MPI ranks that are spread over the reserved nodes, you can clone your repository to all of them:
# Get the data to all nodes
srun --nodes=$SLURM_NNODES --ntasks=$SLURM_NNODES --tasks-per-node=1 datalad clone ${MYSOURCE} ds_${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}
The changes would, however still only be written to one output. This is not a problem, if all output is written with one MPI rank, but it can lead to problems, if you are using parallel writing libraries (an example would be the parallel implementation of HDF5). An alternative would be to make your own temporary directory. In that case, you need to take care of dropping the data at the end of the script yourself. This is most easily done with
datalad drop *
cd ..
rm -rf ds_${SLURM_JOB_ID}_${SLURM_ARRAY_TASK_ID}
at the end of the script.
When specifying the inputs, and especially the outputs, please specify all files needed, but not anything else. Be careful with wildcards, like "*"!
#/bin/bash
JOB_ID=${1}
# Merge all branches into the master branch
git merge -m "Merge results from job ${JOB_ID}" $(git branch -l | grep "job_${JOB_ID}" | tr -d ' ')
# Delete the branches
git branch -d $(git branch -l | grep "job_${JOB_ID}" | tr -d ' ')
This script is to be executed with the Job ID as a parameter, e.g. it can be called by myScript <jobID>.
Using datalad slurm-schedule
๐
This paper introduces an extension to Datalad called datalad-slurm, which circumvents the problems seen before by a local database.
For now, this project is untested on PALMA, however the description seems promising. You can test it yourself. For that, install it with the following commands:
git clone https://https://github.com/knuedd/datalad-slurm.git
cd datalad-slurm
pip install --user -e .
Test if the package is there with
datalad schedule-slurm --help
This also gives you the basic commands and how they are used. Remember that this package is not fully stable yet (Sep. 2025) and the interface might change.