Mo,

Thanks! That sounds like the kind of thing I want. Will I need to wait for the drain to go from Draining to Drained between the two calls or can I make them immediately back to back?

I have picked up but not installed the newest git code, so I should be able to try that out once I get the latest code installed.

Eric

On 5/14/15 3:10 PM, Moe Jette wrote:

There were some changes made to Slurm version 15.08 to support this type
of problem, but they are not available with earlier versions. With the
new version (not yet released), you will be able to do something like this:
scontrol update nodename=foo state=drain reason=whatever
scontrol update nodename=foo state=power_down

Quoting Eric Lund <[email protected]>:
Folks,

I am working on a project that is using Slurm nodes that are
configured for Elastic Computing (Cloud) mode.  For the most part I
have my scripts doing what I want them to do and they work nicely.
The one situation that seems to be troublesome is when an allocation
from my cloud allocator fails or times out while the power save
"resume" script is running.  In that case I get several nodes that are
"not responding" and eventually are declared "down" but never
suspended.  Is there an scontrol state change or something that I can
issue when my resume script fails that will tell slurmctld that the
nodes have not been powered up and it should suspend (or just forget)
them for now and try to resume them again later?

I have tried a few obvious things, but nothing seems to do this simply.

Thanks!

Eric


Reply via email to