Slurm: BeeOND¶

BeeOND is integrated into the prolog and epilog script of the batch system. It can be used on the exclusive compute nodes during the job runtime by requesting it with the constraint flag "BEEOND", "BEEOND_4MDS" or "BEEOND_MAXMDS". For details see here

BEEOND: one metadata server is started on the first node.
BEEOND_4MDS: 4 metadata servers are started within your job. If you have less than 4 nodes less metadata servers are started.
BEEOND_MAXMDS: on every node of your job a metadata server for the on_demand file system is started.

BEEOND will only work if the node is allocated exclusively, meaning that no other jobs are running on the same node. To achieve this, use the batch option "--exclusive". This is particularly important in shared partitions.

As starting point we recommend using the "BEEOND" option. If you are unsure if this is sufficient to you feel free to contact the support team.

#!/bin/bash
#SBATCH ...
#SBATCH --constraint=BEEOND   # BEEOND_4MDS or BEEOND_MAXMDS
#SBATCH --exclusive

After your job has started, you can find the private on-demand file system in /mnt/odfs/$SLURM_JOB_ID directory. The mountpoint comes with five pre-configured directories:

## for small files (stripe count = 1)
$ /mnt/odfs/$SLURM_JOB_ID/stripe_1
## stripe count = 4
$ /mnt/odfs/$SLURM_JOB_ID/stripe_default or /mnt/odfs/$SLURM_JOB_ID/stripe_4
## stripe count = 8, 16 or 32, use this directories for medium sized and large files or when using MPI-IO
$ /mnt/odfs/$SLURM_JOB_ID/stripe_8, /mnt/odfs/$SLURM_JOB_ID/stripe_16 or /mnt/odfs/$SLURM_JOB_ID/stripe_32

If you request less nodes than stripe count, the stripe count will be set to the number of nodes, e.g. if you request 8 nodes, the directory with stripe count 16 will be only used with a stripe count 8.

Be careful when creating large files:

Use always the directory with the greatest stripe count for large files. E.g. if your largest file is 3.1 Tb, then you have to use a stripe count greater than 4 (4 x 750 GB).

The capacity of the private file system depends on the number of nodes. For each node you will get 750 GByte. If you request 100 nodes for your job, your private file system has a capacity of 100 * 750 Gbyte ~ 75 Tbyte (approx).

Recommendation:

The private file system is using its own metadata server. This metadata server is started on the first nodes. Depending on your application, the metadata server is consuming decent amount of CPU power. Probably adding a extra node to your job could improve the usability of the on-demand file system. Start your application with the MPI option:

$ mpirun -nolocal myapplication

With the -nolocal option the node where mpirun is initiated is not used for your application. This node is fully available for the meta data server of your requested on-demand file system.

Example job script:

#!/bin/bash
#very simple example on how to use a private on-demand file system
#SBATCH -N 10
#SBATCH --constraint=BEEOND
#SBATCH --exclusive

#create a workspace
ws_allocate myresults-$SLURM_JOB_ID 90
RESULTDIR=`ws_find myresults-$SLURM_JOB_ID`

#Set ENV variable to on-demand file system
ODFSDIR=/mnt/odfs/$SLURM_JOB_ID/stripe_16/

#start application and write results to on-demand file system
mpirun -nolocal myapplication -o $ODFSDIR/results

#Copy back data after your job application end
rsync -av $ODFSDIR/results $RESULTDIR