There were some changes made to Slurm version 15.08 to support this type of problem, but they are not available with earlier versions. With the new version (not yet released), you will be able to do something like this:
scontrol update nodename=foo state=drain reason=whatever
scontrol update nodename=foo state=power_down

Quoting Eric Lund <[email protected]>:
Folks,

I am working on a project that is using Slurm nodes that are configured for Elastic Computing (Cloud) mode. For the most part I have my scripts doing what I want them to do and they work nicely. The one situation that seems to be troublesome is when an allocation from my cloud allocator fails or times out while the power save "resume" script is running. In that case I get several nodes that are "not responding" and eventually are declared "down" but never suspended. Is there an scontrol state change or something that I can issue when my resume script fails that will tell slurmctld that the nodes have not been powered up and it should suspend (or just forget) them for now and try to resume them again later?

I have tried a few obvious things, but nothing seems to do this simply.

Thanks!

Eric


--
Morris "Moe" Jette
CTO, SchedMD LLC
Commercial Slurm Development and Support

Reply via email to