Hi Xaver,

Your version of Slurm may matter for your power saving experience. Do you run an updated version?

/Ole

On 12/6/23 10:54, Xaver Stiensmeier wrote:
Hi Ole,

I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not always giving all information. We run our solution for about a year
now so I don't think there's a general problem (as in something that
necessarily occurs) with the command. But I will take a closer look. I
really feel like it has to be something more conditional though as
otherwise the error would've occurred more often (i.e. every time when
handling a fail and the command is execute).
>>
IHTH,
Ole



--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620
Your repository would've been really helpful for me when we started>>
IHTH,
Ole



--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620
implementing the cloud scheduling, but I feel like we have implemented
most things you mention there already. But I will take a look at
`DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find
out; SLURM plans/planned to change that in the future (cloud key behaves
different than any other key in PrivateData). Of course our setup
differs a little in the details.

Best regards
Xaver

On 06.12.23 10:30, Ole Holm Nielsen wrote:
Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:
using https://slurm.schedmd.com/power_save.html we had one case out
of many (>242) node starts that resulted in

|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances
- that are created on demand - are again `~idle` after being removed
by the fail program. They are set to RESUME before the actual
instance gets destroyed. I remember that I had this case manually
before, but I don't remember when it occurs.

Maybe someone has a great idea how to tackle this problem.

Probably you can't assign a "reason" when you update a node with
state=RESUME.  The scontrol manual page says:

Reason=<reason> Identify the reason the node is in a "DOWN",
"DRAINED", "DRAINING", "FAILING" or "FAIL" state.

Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving

and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

Reply via email to