[slurm-users] Re: How do you handle GPU node failures during long jobs?

Tina Friedrich via slurm-users Thu, 26 Mar 2026 03:56:48 -0700

Hello,

at least for nvidia GPUs, we have the Node Health Check check dcgmihealth output - so we have health watchers set on the GPU, and if dcgmireports errors, that drains the nodes. We're trying to do somethingsimilar for our AMD GPUs but there doesn't seem to be a 'live' healthcheck like that, so on those we periodically run a diagnostics script &check the output of that as part of NHC.

We've also found failure conditions on some of our GPU nodes that dcgmihealth watchers don't pick up on, and have implemented separate checksfor those (again, they've been added to the NHC script).

My opinion is that it's always better to have the HealthCheckProgrampick up on errors, rather than rely on 'manual' discovery.

We don't do anything about jobs on the nodes - I mean if a GPU diesmid-job the job(s) using the GPU(s) will likely fail anyway, and thenode goes into drain state, so...


Tina

On 15/03/2026 03:46, Antonio Jose Alonso-Stepanov via slurm-users wrote:

Hi all,
I'm a Stanford CS student looking into how sites handle GPU nodefailures during long-running jobs. A couple questions:
When a GPU node goes down mid-job, do most sites use Slurm's requeue or--no-kill to handle it, or is it mostly manual drain and resubmit?
Is anyone using HealthCheckProgram to catch GPU issues (like ECC errorsvia DCGM), or do you handle GPU health monitoring outside of Slurm?
Curious what's worked and what hasn't. Thanks.

Antonio


--
Tina Friedrich, Snr HPC Systems Administrator,
Advanced Research Computing (ARC), The University of Oxford
https://www.arc.ox.ac.uk/


--
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[slurm-users] Re: How do you handle GPU node failures during long jobs?

Reply via email to