Folks,

I am working on a project that is using Slurm nodes that are configured for Elastic Computing (Cloud) mode. For the most part I have my scripts doing what I want them to do and they work nicely. The one situation that seems to be troublesome is when an allocation from my cloud allocator fails or times out while the power save "resume" script is running. In that case I get several nodes that are "not responding" and eventually are declared "down" but never suspended. Is there an scontrol state change or something that I can issue when my resume script fails that will tell slurmctld that the nodes have not been powered up and it should suspend (or just forget) them for now and try to resume them again later?

I have tried a few obvious things, but nothing seems to do this simply.

Thanks!

Eric

Reply via email to