Hi all,

I'm a Stanford CS student looking into how sites handle GPU node failures 
during long-running jobs. A couple questions:

When a GPU node goes down mid-job, do most sites use Slurm's requeue or 
--no-kill to handle it, or is it mostly manual drain and resubmit?

Is anyone using HealthCheckProgram to catch GPU issues (like ECC errors via 
DCGM), or do you handle GPU health monitoring outside of Slurm?

Curious what's worked and what hasn't. Thanks.

Antonio
-- 
slurm-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to