Hello,
at least for nvidia GPUs, we have the Node Health Check check dcgmi
health output - so we have health watchers set on the GPU, and if dcgmi
reports errors, that drains the nodes. We're trying to do something
similar for our AMD GPUs but there doesn't seem to be a 'live' health
check like that, so on those we periodically run a diagnostics script &
check the output of that as part of NHC.
We've also found failure conditions on some of our GPU nodes that dcgmi
health watchers don't pick up on, and have implemented separate checks
for those (again, they've been added to the NHC script).
My opinion is that it's always better to have the HealthCheckProgram
pick up on errors, rather than rely on 'manual' discovery.
We don't do anything about jobs on the nodes - I mean if a GPU dies
mid-job the job(s) using the GPU(s) will likely fail anyway, and the
node goes into drain state, so...
Tina
On 15/03/2026 03:46, Antonio Jose Alonso-Stepanov via slurm-users wrote:
Hi all,
I'm a Stanford CS student looking into how sites handle GPU node
failures during long-running jobs. A couple questions:
When a GPU node goes down mid-job, do most sites use Slurm's requeue or
--no-kill to handle it, or is it mostly manual drain and resubmit?
Is anyone using HealthCheckProgram to catch GPU issues (like ECC errors
via DCGM), or do you handle GPU health monitoring outside of Slurm?
Curious what's worked and what hasn't. Thanks.
Antonio
--
Tina Friedrich, Snr HPC Systems Administrator,
Advanced Research Computing (ARC), The University of Oxford
https://www.arc.ox.ac.uk/
--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]