Good mechanisms for this include the Prolog, Epilog, and
HealthCheckProgram. A non-zero exit code from the Prolog or Epilog
will set the node's state to Down. The HealthCheckProgram or other
tools would need to use scontrol to set the node state Down.
Quoting Michael Di Domenico <[email protected]>:
Does slurm have an plugins or abilities for black holing a bad node.
We had a situation recently with a big queue, where a single node that
could accept jobs but could not run jobs drained the queue for a user,
but no work was actually done. I cursory look at the docs and
internet search didn't turn up anything, not even someone else asking
the same question which seems odd...