Slurm: Job Exit Codes
A job's exit code (also known as exit status, return code and completion code) is captured by SLURM and saved as part of the job record.
Any non-zero exit code will be assumed to be a job failure and will result in a Job State of FAILED with a reason of "NonZeroExitCode".
The exit code is an 8 bit unsigned number ranging between 0 and 255. While it is possible for a job to return a negative exit code, SLURM will display it as an unsigned value in the 0 - 255 range.
Displaying Exit Codes and Signals¶
Slurm displays a job's exit code in the output of the scontrol show job
and the sview utility.
When a signal was responsible for a job or step's termination, the signal number will be displayed after the exit code, delineated by a colon(:).
Submitting Termination Signal¶
Here is an example, how to save a sbatch termination signal in a typical HoreKa-submit script.
[...]
exit_code=$?
echo "### Calling YOUR_PROGRAM command ..."
mpirun -np 'NUMBER_OF_CORES' $YOUR_PROGRAM_BIN_DIR/runproc ... (options) 2>&1
[ "$exit_code" -eq 0 ] && echo "all clean..." || \
echo "Executable ${YOUR_PROGRAM_BIN_DIR}/runproc finished with exit code ${$exit_code}"
[...]
- Do not use
time
mpirun! The exit code will be the one submitted by the first (time) program. - You do not need an
exit $exit_code
in the scripts.