Gaudi 2/3¶
The official documentation for Intel Gaudi can be found at https://docs.habana.ai/en/latest/index.html
Hardware Overview¶
The system management interface tool hl-smi
aids in the management and monitoring of the Gaudi accelerators.
Running hl-smi
without an Options argument set displays a summary table of the detected Gaudi devices:
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.19.1-fw-57.2.2.0 |
| Driver Version: 1.19.1-6f47ddd |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-225 N/A | 0000:19:00.0 N/A | 0 |
| N/A 27C N/A 82W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 1 HL-225 N/A | 0000:b3:00.0 N/A | 0 |
| N/A 26C N/A 73W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 2 HL-225 N/A | 0000:1a:00.0 N/A | 0 |
| N/A 28C N/A 75W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 3 HL-225 N/A | 0000:b4:00.0 N/A | 0 |
| N/A 28C N/A 79W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 4 HL-225 N/A | 0000:43:00.0 N/A | 0 |
| N/A 28C N/A 81W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 5 HL-225 N/A | 0000:cc:00.0 N/A | 0 |
| N/A 28C N/A 73W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 6 HL-225 N/A | 0000:44:00.0 N/A | 0 |
| N/A 27C N/A 82W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| 7 HL-225 N/A | 0000:cd:00.0 N/A | 0 |
| N/A 28C N/A 77W / 600W | 768MiB / 98304MiB | 0% N/A |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 N/A N/A N/A N/A |
| 1 N/A N/A N/A N/A |
| 2 N/A N/A N/A N/A |
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 N/A N/A N/A N/A |
| 6 N/A N/A N/A N/A |
| 7 N/A N/A N/A N/A |
+=============================================================================+
Using Gaudi Accelerators¶
Intel provides a custom Intel Gaudi PyTorch environment which is optimized for Intel Gaudi AI accelerator. The necessary software is preinstalled on the nodes and can be loaded using the Lmod module system. In order to avoid ambiguities, it is advisable to load architecture-specific modules on the compute nodes, working in an allocation created with salloc on the requested node:
salloc -p gaudi2 -t 01:00:00
salloc -p gaudi3 -t 01:00:00
We can then load the pytorch module:
module purge
module load toolkit/gaudi-torch
Afterwards a python program can be started via:
python torch_example.py
Code examples and more information can be found here.
Software for Gaudi¶
From our experience, the least error prone way to get a functional software stack for Gaudi is to use containers. These are provided by Intel directly. Here is a small working example to encorporate a Docker image from Intel, use the enroot runtime to launch it and to run a hello world example script found here here.
# Login to FTP
$ ssh <user>@ftp-x86-login.scc.kit.edu
# Allocate Gaudi 3 Node
$ salloc -p gaudi3 -t 01:00:00
# Get Docker container
$ enroot import docker://vault.habana.ai#gaudi-docker/1.21.1/ubuntu24.04/habanalabs/pytorch-installer-2.6.0:latest
$ enroot create gaudi-docker+1.21.1+ubuntu24.04+habanalabs+pytorch-installer-2.6.0+latest.sqsh
# Download sample code
$ mkdir -p gaudi && cd gaudi
$ git clone -b 1.21.0 https://github.com/HabanaAI/Model-References.git
$ cd Model-References/PyTorch/examples/computer_vision/hello_world/
# Start container
$ enroot start --rw -m .:/mnt gaudi-docker+1.21.1+ubuntu24.04+habanalabs+pytorch-installer-2.6.0 bash
$ cd /mnt
# Run sample case
$ mpirun -n 8 --bind-to core --map-by socket:PE=6 \
--rank-by core --report-bindings \
--allow-run-as-root \
-x PT_HPU_LAZY_MODE=0 python -W ignore mnist.py \
--batch-size=64 --epochs=1 \
--lr=1.0 --gamma=0.7 \
--hpu --autocast --use-torch-compile
# Login to FTP
$ ssh <user>@ftp-x86-login.scc.kit.edu
# Login to active compute node
$ srun --jobid $(squeue --format=%i -h|head -n 1) --pty bash
# Monitor the activity
$ hl-smi -l 1
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.21.1-fw-59.2.3.0 |
| Driver Version: 1.21.0-ca59b5a |
| Nic Driver Version: 1.21.0-732bcf3 |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-325L N/A | 0000:17:00.0 N/A | 0 |
| N/A 45C P0 222W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 1 HL-325L N/A | 0000:97:00.0 N/A | 0 |
| N/A 44C P0 223W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 2 HL-325L N/A | 0000:2c:00.0 N/A | 0 |
| N/A 39C P0 219W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 3 HL-325L N/A | 0000:ba:00.0 N/A | 0 |
| N/A 39C P0 220W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 4 HL-325L N/A | 0000:3d:00.0 N/A | 0 |
| N/A 41C P0 218W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 5 HL-325L N/A | 0000:a9:00.0 N/A | 0 |
| N/A 43C P0 222W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 6 HL-325L N/A | 0000:4e:00.0 N/A | 0 |
| N/A 45C P0 217W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 7 HL-325L N/A | 0000:cb:00.0 N/A | 0 |
| N/A 43C P0 224W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 1661819 C python 130400MiB
| 1 1661823 C python 130400MiB
| 2 1661820 C python 130400MiB
| 3 1661825 C python 130400MiB
| 4 1661821 C python 130400MiB
| 5 1661824 C python 130400MiB
| 6 1661822 C python 130400MiB
| 7 1661826 C python 130400MiB
+=============================================================================+