Good mechanisms for this include the Prolog, Epilog, and HealthCheckProgram. A non-zero exit code from the Prolog or Epilog will set the node's state to Down. The HealthCheckProgram or other tools would need to use scontrol to set the node state Down.

Quoting Michael Di Domenico <[email protected]>:

Does slurm have an plugins or abilities for black holing a bad node.
We had a situation recently with a big queue, where a single node that
could accept jobs but could not run jobs drained the queue for a user,
but no work was actually done.  I cursory look at the docs and
internet search didn't turn up anything, not even someone else asking
the same question which seems odd...




Reply via email to