Multithreaded + MPI parallel Programs¶

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

OpenMPI with Multithreading¶

Multiple MPI tasks using OpenMPI must be launched by the MPI parallel program mpirun. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

For OpenMPI a job-script to submit a batch job called job_ompi_omp.sh that runs a MPI program with 4 tasks and an 76-fold threaded program ompi_omp_program requiring 3000 MByte of physical memory per thread (using 10 threads per MPI task you will get 10*3000 MByte = 30000 MByte per MPI task) and total wall clock time of 3 hours looks like:

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=76
#SBATCH --time=03:00:00
#SBATCH --mem=30gb
#SBATCH --export=ALL,MPI_MODULE=mpi/openmpi/4.1,EXECUTABLE=./ompi_omp_program
#SBATCH --output="parprog_hybrid_%j.out"

# Use when a defined module environment related to OpenMPI is wished
module load ${MPI_MODULE}
export MPIRUN_OPTIONS="--bind-to core --map-by socket:PE=${SLURM_CPUS_PER_TASK} -report-bindings"
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export NUM_CORES=${SLURM_NTASKS}*${SLURM_CPUS_PER_TASK}
echo "${EXECUTABLE} running on ${NUM_CORES} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpirun -n ${SLURM_NTASKS} ${MPIRUN_OPTIONS} ${EXECUTABLE}"
echo $startexe
exec $startexe

Execute the script job_ompi_omp.sh by command sbatch:

$ sbatch -p cpuonly job_ompi_omp.sh

With the mpirun option --bind-to core MPI tasks and OpenMP threads are bound to physical cores.
With the option --map-by socket:PE=<value> (neighbored) MPI tasks will be attached to different sockets and each MPI task is bound to the (in <value>) specified number of cpus. <value> must be set to ${OMP_NUM_THREADS}.
The option ''-report-bindings'' shows the bindings between MPI tasks and physical cores.
The mpirun-options --bind-to core, --map-by socket|...|node:PE=<value> should always be used when running a multithreaded MPI program.

Intel MPI with Multithreading¶

Multithreaded + MPI parallel programs operate faster than serial programs on multi CPUs with multiple cores. All threads of one process share resources such as memory. On the contrary MPI tasks do not share memory but can be spawned over different nodes.

Multiple Intel MPI tasks must be launched by the MPI parallel program mpiexec.hydra. For multithreaded programs based on Open Multi-Processing (OpenMP) number of threads are defined by the environment variable OMP_NUM_THREADS. By default this variable is set to 1 (OMP_NUM_THREADS=1).

For Intel MPI a job-script to submit a batch job called job_impi_omp.sh that runs a Intel MPI program with 4 tasks and a 76-fold threaded program impi_omp_program requiring 64000 MByte of total physical memory per task and total wall clock time of 1 hours looks like:

#!/bin/bash
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=76
#SBATCH --time=60
#SBATCH --mem=64000
#SBATCH --export=ALL,MPI_MODULE=mpi/impi,EXE=./impi_omp_program
#SBATCH --output="parprog_impi_omp_%j.out"

#If using more than one MPI task per node please set
export KMP_AFFINITY=compact,1,0
#export KMP_AFFINITY=verbose,scatter  prints messages concerning the supported affinity
#KMP_AFFINITY Description: https://software.intel.com/en-us/node/524790#KMP_AFFINITY_ENVIRONMENT_VARIABLE

# Use when a defined module environment related to Intel MPI is wished
module load ${MPI_MODULE}
export OMP_NUM_THREADS=${SLURM_CPUS_PER_TASK}
export MPIRUN_OPTIONS="-binding "domain=omp:compact" -print-rank-map -envall"
export NUM_PROCS=eval(${SLURM_NTASKS}*${OMP_NUM_THREADS})
echo "${EXE} running on ${NUM_PROCS} cores with ${SLURM_NTASKS} MPI-tasks and ${OMP_NUM_THREADS} threads"
startexe="mpiexec.hydra -bootstrap slurm ${MPIRUN_OPTIONS} -n ${SLURM_NTASKS} ${EXE}"
echo $startexe
exec $startexe

Using Intel compiler the environment variable KMP_AFFINITY switches on binding of threads to specific cores. If you only run one MPI task per node please set KMP_AFFINITY=compact,1,0.

If you want to use 128 or more nodes, you must also set the environment variable as follows:

$ export I_MPI_HYDRA_BRANCH_COUNT=-1

If you want to use the options perhost, ppn or rr, you must additionally set the environment variable

$ export I_MPI_JOB_RESPECT_PROCESS_PLACEMENT=off

Execute the script job_impi_omp.sh by command sbatch:

$ sbatch -p cpuonly job_impi_omp.sh

The mpirun option -print-rank-map shows the bindings between MPI tasks and nodes (not very beneficial). The option -binding binds MPI tasks (processes) to a particular processor; domain=omp means that the domain size is determined by the number of threads. In the above examples (2 MPI tasks per node) you could also choose -binding "cell=unit;map=bunch"; this binding maps one MPI process to each socket.