Containers on HPC systems¶
To date, only few container runtime environments integrate well with HPC environments due to security concerns and differing assumptions in some areas.
For example native Docker environments require elevated privileges, which is not an option on shared HPC resources. Docker's "rootless mode" is also currently not supported on our HPC systems because it does not support necessary features such as cgroups resource controls, security profiles, overlay networks, furthermore GPU passthrough is difficult. Necessary subuid (
newuidmap) and subgid (
newgidmap) settings may impose security issues.
Further rootless container runtime environments (Podman, …) might be supported in the future, depending on how support for e.g. network interconnects, security features and HPC file systems develops.
Enroot enables you to run Docker containers on HPC systems. It is developed by NVIDIA. It is the recommended tool to use containers on HoreKa and integrates well with GPU usage.
Enroot is available to all users by default.
Excellent documentation is provided on NVIDIA's github page. This documentation here therefore confines itself to simple examples to get to know the essential functionalities.
Using Docker containers with Enroot requires three steps:
- Importing an image
- Creating a container
- Starting a container
Optionally containers can also be exported and transferred.
Importing a container image
$ enroot import docker://alpine
This pulls the latest alpine image from dockerhub (default registry). You will obtain the file
$ enroot import docker://nvcr.io#nvidia/pytorch:21.04-py3
This pulls the latest pytorch image from NVIDIA's NGC registry. You will obtain the file
$ enroot import docker://registry.scc.kit.edu#myProject/myImage:latest
This pulls your latest Image from the KIT registry. You obtain the file
Creating a container
$ enroot create --name nvidia+pytorch+21.04-py3 nvidia+pytorch+21.04-py3.sqsh
Create a container named
nvidia+pytorch+21.04-py3by unpacking the
Creating a container means that the squashed container image is unpacked inside
$ENROOT_DATA_PATH/. By default this variable points to
$HOME/.local/share/enroot/. If there are quota restrictions,
$ENROOT_DATA_PATH can also be set to point to a workspace.
Starting a container
$ enroot start --rw nvidia+pytorch+21.04-py3 bash
Start the container
nvidia+pytorch+21.04-py3in read-write mode (
--rw) and run
bashinside the container.
$ enroot start --root --rw nvidia+pytorch+21.04-py3 bash
Start container in
--rw-mode and get root access (
--root) inside the container.
You can now install software with root privileges, depending on the containerized Linux distribution e.g. with
apt-get install …,
apk add …,
yum install …,
pacman -S …
$ enroot start -m <localDir>:/work --rw nvidia+pytorch+21.04-py3 bash
Start container and mount (
-m) a local directory to
/workinside the container.
$ enroot start -m <localDir>:/work --rw nvidia+pytorch+21.04-py3 jupyter lab
Start container, mount a directory and start the application
Exporting and transfering containers
If you intend to use Docker images which you built e.g. on your local desktop, and transfer them somewhere else, there are several possibilities to do so:
Build an image via
docker buildand a Dockerfile, import this image from the Docker daemon via
$ enroot import --output myImage.sqsh dockerd://myImage.
.sqsh-file to HoreKa and import it with
Export an existing enroot container via
$ enroot export --output myImage.sqsh myImage.
.sqsh-file to HoreKa and import it with
Create a self extracting bundle from a container image via
$ enroot bundle --output myImage.run myImage.sqsh.
.run-file to HoreKa. You can run the self extracting image via
./myImage.runeven if enroot is not installed!
You can list all containers on the system with the
enroot list command. Additional information is revealed by the
The containers can be removed with the
enroot remove command, or by simply deleting
Integration of Enroot into Slurm means that the use of containers is controlled by Slurm parameters. Here, the steps required when using Enroot manually, such as image download, unpacking, and starting the container, are performed automatically by calling
sbatch. The necessary function extension of Slurm is realized via the Pyxis plugin.
"Pyxis is a SPANK plugin for the Slurm Workload Manager. It allows unprivileged cluster users to run containerized tasks through the srun command." 1
- Execute the user's task in an unprivileged container.
- Command-line interface.
- Docker image download with support for layers caching.
- Supports multi-node MPI jobs through PMI2 or PMIx.
- Allows users to install packages inside the container.
- Works with shared filesystems.
- Does not require cluster-wide management of subordinate user/group ids.
For a detailed description of the usage please refer to NVIDIA's github section.
Run a command inside of a containerSlurm in conjunction with the Pyxis plugin submits a job on partion cpuonly, automatically downloads the Centos image from dockerhub, (temporarily) unpacks the root file system and executes the command
$ srun -p cpuonly --container-image=centos grep PRETTY /etc/os-release PRETTY_NAME="CentOS Linux 8"`
grep PRETTY /etc/os-releaseinside the running container. Non-standard registries can be used via
--container-image=<otherRegistry>#<imageName>and even local sqsh-files via
Reuse containers by naming them
First run:By naming the container, the root file system will be unpacked under
$ srun -p single --container-image=centos --container-name=myCentos grep PRETTY /etc/os-release
$ENROOT_DATA_PATH/pyxis_myCentos. For consecutive runs, the following command can be used:
$ srun -p single --container-name=myCentos grep PRETTY /etc/os-release
Tensorflow+GPU example:This command spawns a job with one GPU on the accelerated partition. The
$ srun -p accelerated \ --gres=gpu:1 \ --container-mounts=/etc/slurm/task_prolog.hk:/etc/slurm/task_prolog.hk,/scratch:/scratch \ --container-name=nvidia+tensorflow+21.07-tf2-py3 \ bash -c 'python -c "import tensorflow as tf; tf.config.list_physical_devices()"' … Found device 0 with properties: pciBusID: 0000:31:00.0 name: NVIDIA A100-SXM4-40GB computeCapability: 8.0 …
--container-mountsoption mounts some directories inside the container, which are required for a smooth operation with the HoreKa-Slurm configuration.
The named container can be prepared either by a first run with
--container-image=nvidia+tensorflow+21.07-tf2-py3or by downloading (and eventually modifying) it manually:
$ enroot import docker://nvcr.io#nvidia/tensorflow:21.07-tf2-py3 $ enroot create --name pyxis_nvidia+tensorflow+21.07-tf2-py3 nvidia+tensorflow+21.07-tf2-py3.sqsh
Sbatch example: Execute a sbatch script inside a container image
$ cat pyxis-sbatch.sh #!/bin/bash #SBATCH --container-image nvcr.io\#nvidia/tensorflow:22.02-tf2-py3 #SBATCH --container-mounts=/etc/slurm/task_prolog.hk:/etc/slurm/task_prolog.hk,/scratch:/scratch #SBATCH -p accelerated #SBATCH --gres=gpu:1 python -c "import tensorflow as tf;\ print(tf.__version__);\ print(tf.reduce_sum(tf.random.normal([10000, 10000])))"
$ sbatch pyxis-sbatch.sh -o slurm.out $ cat slurm.out pyxis: importing docker image ... … 2.7.0 tf.Tensor(-5098.84, shape=(), dtype=float32)
Please note the escape seqeuence
\ in the
nvcr.io\#nvidia/tensorflow:22.02-tf2-py3 string. Without, Slurm/Bash would interpret the separator
# between the registry and the image name as a comment.
How can I run JupyterLab in a container and connect to it?
- Start an interactive session with or without GPUs. Notice the compute node ID the session is running on, and start a container with a running JupyterLab, e.g.:
enroot start -m <localDir>:/work --rw nvidia+pytorch+21.04-py3 jupyter lab --no-browser --ip 0.0.0.0
- Open a terminal on your desktop and create a SSH-tunnel to the running JupyterLab instance on the compute node. Insert the node ID, where the interactive session is running on:
ssh -L8888:<computeNodeID>:8888 <yourAccount>@hk.scc.kit.edu
- Open a web browser and open the URL localhost:8888
- Enter the token, which is visible in the output of the first terminal. The output should look like this:
Copy the string behind the
Or copy and paste this URL: http://hostname:8888/?token=fdaa7bf344b9ef3c0623c6e…4ce56dd845
token=and paste it into the input field in the browser.
Are GPUs accessible from within a running container?
Unlike Docker, Enroot does not need further command line options to enable GPU passthrough like
Is there something like
AFAIK no. Enroot is mainly intended for HPC workloads, not for operating multi-container applications. However, starting and running these applications separately is possible.
Can I use workspaces to store containers?
You can define the location of configuration files and storage with environment variables. The
ENROOT_DATA_PATH variable should be set accordingly. Please refer to NVIDIA's documentation on runtime configuration.
- Source code: https://github.com/NVIDIA/enroot
- Documentation: https://github.com/NVIDIA/enroot/blob/master/doc
- Additional information: FOSDEM 2020 talk + slides
Membership in a special group is required to be able to use Singularity . Please refer to the Support channels to request access.
Excellent documentation is provided on the Documentation&Examples page provided by Sylabs, the company behind Singularity. This documentation here therefore confines itself to simple examples to get to know the essential functionalities.
Using Singularity usually involves two steps:
Building a container image using
Running a container image using
Building an image
singularity build ubuntu.sif library://ubuntu
This pulls the latest Ubuntu image from Singularity's Container Library and locally creates a container image file called
singularity build alpine.sif docker://alpine
This pulls the latest alpine image from Dockerhub and locally creates a container image file called
singularity build pytorch-21.04-p3.sif docker://nvcr.io#nvidia/pytorch:21.04-py3
This pulls the latest pytorch image from NVIDIA's NGC registry and locally creates a container image file called
Running an image
singularity shell ubuntu.sifStart a shell in the Ubuntu container.
singularity run alpine.sif
Start the container
alpine.sifand run the default runscript provided by the image.
singularity exec alpine.sif /bin/ls
Start the container
alpine.sifand run the
You can use the
singularity search command to search for images on Singularity's Container Library.