Data & Storage 🔗

Storage at PALMA has to be quite fast to handle data transfers from several hundred of nodes at the same time and is therefore quite expensive. Consequently, the user's scratch folder at PALMA is a place to handle and store simulation results only for an intermediate amount of time. It is not an archive! We therefore ask you to always remove old data from your scratch folder as soon as possible.

Here follows an overview of storage types that can be used on PALMA:

PATH	Purpose	Available on	Protection against data loss
`/home/[a-z]/<username>`	Storage for scripts, binaries, applications etc.	Login and Compute Nodes	21 nightly snapshots to recover accidentally deleted files. No backup.
`/scratch/tmp/<username>`	Temporary storage for simulation input and results	Login and Compute Nodes	No backup, no snapshots.
`/cloud/wwu1/<projectname>/<sharename>`	Cloud storage solution / archive	Login and Compute Nodes	No backup, no snapshots.
`/tmp/slurm_${USER}.${SLURM_JOBID}`	Local node storage	Compute Nodes	Only exists for the time of the job.

Recommended data storage policy 🔗

The recommended way to use the different storage is the following:

Put your own programs, scripts, etc in home.
We strongly suggest to use git in combination with the Gitlab of the university for versioning.
During all of your simulations, write data only to scratch.
If you do not need your data on scratch any longer, delete it, copy it to local facilities or create a project in the cloud and archive your data there.

Quota limits 🔗

Both the number of files as well as used storage capacity are subjected to limitations on palma, differently for home and scratch:

PATH	Memory Quota	File Limit Quota
`/home/[a-z]/<username>`	25 GB	200.000 (200K) files
`/scratch/tmp/<username>`	5 TB	1.000.000 (1M) files

You can check your current status calling myquota from your terminal on PALMA.

📣

The quota limits are chosen as a compromise between fulfilling our users' computational needs on the one hand and keeping the system running on the other. In this logic quota extensions are discouraged, and we invite you to test the workaraounds detailed in what follows before asking for any extension.

File limit (number of files > 1M) 🔗

Some usecases require a great number of files, either because the raw data to be processed is very fragmented or the software used requires a lot of input files. We provide a viable solution for each of these two cases.

Input data composed of too many files 🔗

One likely situation is that the input data of your analysis pipeline is composed of many small files (e.g. gene sequences). If this data only needs to be read during the computation, there is the possibility of creating a SquashFS, i.e. a read-only archive of files that can be mounted as a filesystem. This can be done by first creating the filesystem:

mksquashfs mydata/ mydata.squashfs

Now your data will be squashed into the single file mydata.squashfs (similar to how ZIP or TAR archives work). If your data folder is already too big for scratch, you can create the SquashFS directly from your cloud directory. Otherwise, the data folder can be tar-ed or moved to the cloud to free up your quota again. Notably, you cannot access mydata.squashfs without having mounting it first (i.e. hooking the SquashFS somewhere into the filesystem).

📣

Mounting squashfs on the login node

Currently, the filesystem cannot be mounted on your scratch or home directories. For this reason, if you want to access the filesystem from a session on the login node, you should mount it under /tmp/your-mountpoint-name. Please note that:

Since everybody on the login node can access the /tmp folder and thus see your mounted filesystem.
This is a temporary solution and we will update this documentation bit accordingly when the creation on mountpoints on scratch and home will be allowed.

The alternative and suggested solution is to mount the filesystem when you run the computation itself. In this way you use the /tmp filesystem of the compute node you are on (available via the environment variable $TMPDIR), which automatically cleans up the mountpoint after your computation.

In a nutshell, in your batch script should look like this:

#SBATCH preambles...
# ...

# Create mountpoint and mount SquashFS
mkdir $TMPDIR/mountpoint
squashfuse /path/to/mydata.squashfs $TMPDIR/mountpoint

... run your computation here

# Unmount SquashFS to clean up
fusermount -u $TMPDIR/mountpoint

The mounting and unmounting steps can of course be repeated if you have multiple datasets.

Software composed of too many files 🔗

It can happen that some software (and more often, its dependencies) hit the limit of 1M files. In this case, one viable solution is to create a container. On PALMA we offer Apptainer (former Singularity) as tool to handle containers without root access. You can find excellent documentation on how to build and run your own containers in the Apptainer user guide. If you need assistance, you can contact us at hpc@uni-muenster.de or on Mattermost for help on building your own container.

Memory Limit (data > 5 TB) 🔗

If the input data exceeds the 5 TB threshold, please have a look to the next subsection, Cloud Storage.

Cloud Storage 🔗

If transferring data to your local machine is not feasible and you are running out of space on scratch, take a look at the Uni Cloud storage options. This OpenStack-based cloud system can provide so called usershares (disk space in form of a network share) which you can access directly from PALMA but also from your local machine. To do this, you have to take the following steps:

Apply for a cloud project if your group does not have one yet.
Once your project is available, create a usershare (Follow the steps as explained here):
1. Login at openstack.wwu.de
2. Go to Share → Shares and click on Create Share
3. Fill in the appropriate information (Name, Protocol, Size , ...)
4. Your usershare will usually be created within one minute and can be accessed from the login node at: /cloud/wwu1//
Transfer data to & from your scratch folder on PALMA to your cloud usershare folder