There were some changes made to Slurm version 15.08 to support this
type of problem, but they are not available with earlier versions.
With the new version (not yet released), you will be able to do
something like this:
scontrol update nodename=foo state=drain reason=whatever
scontrol update nodename=foo state=power_down
Quoting Eric Lund <[email protected]>:
Folks,
I am working on a project that is using Slurm nodes that are
configured for Elastic Computing (Cloud) mode. For the most part I
have my scripts doing what I want them to do and they work nicely.
The one situation that seems to be troublesome is when an allocation
from my cloud allocator fails or times out while the power save
"resume" script is running. In that case I get several nodes that
are "not responding" and eventually are declared "down" but never
suspended. Is there an scontrol state change or something that I
can issue when my resume script fails that will tell slurmctld that
the nodes have not been powered up and it should suspend (or just
forget) them for now and try to resume them again later?
I have tried a few obvious things, but nothing seems to do this simply.
Thanks!
Eric
--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support