I'm just curious as to what causes a user to decide that a given node has
an issue?
If a node is healthy in all respects, why would a user decide not to use
the node?
Not enough free TMPDIR space, a GPU starts having memory errors, or a machine
with a temporary issue that slurm health checks are not tracking at the time so
it can blackhole jobs.
But honestly, this is less about dealing with actual technical problems and
more about dealing with keeping users happy as we help port their existing
Univa jobs to slurm. We have a user with a run script that will add the local
node to the exclude list and requeue itself up to 5 times if it thinks the
program it launched is not running correctly because of a machine issue. I
could emulate this behavior easily if the running job could update its own
ExcNodeList and requeue itself. I can have a job requeue itself (just sleep
after the scontrol command as the requeue is not instant) but slurm does not
seem to let me update ExcNodeList on a running job.
Thanks for your suggestions.