Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Xaver, we also had a similar problem with Slurm 21.08 (see thread "error: power_save module disabled, NULL SuspendProgram"). Fortunately, we have not yet observed this since the upgrade to 23.02. But the time period (about a month) is still too short to know if the problem is really fixed as we are still in the normal recurrence period of that event. Best regards, Stefan Am Mittwoch, 6. Dezember 2023, 12:14:46 CET schrieb Xaver Stiensmeier: > Hi Ole, > > for multiple reasons we build it ourself, but I am not really involved > in that process, but I will contact the person who is. Thanks for the > recommendation! We should probably implement a regular check whether > there is a new slurm version. I am not 100% whether this will fix our > issues or not, but it's worth a try. > > Best regards > Xaver > > On 06.12.23 12:03, Ole Holm Nielsen wrote: > > On 12/6/23 11:51, Xaver Stiensmeier wrote: > >> Good idea. Here's our current version: > >> > >> ``` > >> sinfo -V > >> slurm 22.05.7 > >> ``` > >> > >> Quick googling told me that the latest version is 23.11. Does the > >> upgrade change anything in that regard? I will keep reading. > > > > There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving > > Power with Slurm" at https://slurm.schedmd.com/publications.html > > > > For reasons of security and functionality it is recommended to follow > > Slurm's releases (maybe not the first few minor versions of new major > > releases like 23.11). FYI, I've collected information about upgrading > > Slurm in the Wiki page > > https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-sl > > urm > > > > /Ole -- Albert-Ludwigs-Universität Freiburg Institut für Informatik Professur für Maschinelles Lernen Stefan Stäglich System-Administrator T +49 761 203-8223 staeg...@informatik.uni-freiburg.de https://ml.informatik.uni-freiburg.de Georges-Köhler-Allee 52 D-79110 Freiburg smime.p7s Description: S/MIME cryptographic signature
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Ole, for multiple reasons we build it ourself, but I am not really involved in that process, but I will contact the person who is. Thanks for the recommendation! We should probably implement a regular check whether there is a new slurm version. I am not 100% whether this will fix our issues or not, but it's worth a try. Best regards Xaver On 06.12.23 12:03, Ole Holm Nielsen wrote: On 12/6/23 11:51, Xaver Stiensmeier wrote: Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving Power with Slurm" at https://slurm.schedmd.com/publications.html For reasons of security and functionality it is recommended to follow Slurm's releases (maybe not the first few minor versions of new major releases like 23.11). FYI, I've collected information about upgrading Slurm in the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm /Ole
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
On 12/6/23 11:51, Xaver Stiensmeier wrote: Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving Power with Slurm" at https://slurm.schedmd.com/publications.html For reasons of security and functionality it is recommended to follow Slurm's releases (maybe not the first few minor versions of new major releases like 23.11). FYI, I've collected information about upgrading Slurm in the Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm /Ole
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Ole, Good idea. Here's our current version: ``` sinfo -V slurm 22.05.7 ``` Quick googling told me that the latest version is 23.11. Does the upgrade change anything in that regard? I will keep reading. Xaver On 06.12.23 11:09, Ole Holm Nielsen wrote: Hi Xaver, Your version of Slurm may matter for your power saving experience. Do you run an updated version? /Ole On 12/6/23 10:54, Xaver Stiensmeier wrote: Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It might be ignored though. You can also give a reason when defining the states POWER_UP and POWER_DOWN. Slurm's documentation is not always giving all information. We run our solution for about a year now so I don't think there's a general problem (as in something that necessarily occurs) with the command. But I will take a closer look. I really feel like it has to be something more conditional though as otherwise the error would've occurred more often (i.e. every time when handling a fail and the command is execute). >> IHTH, Ole
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Xaver, Your version of Slurm may matter for your power saving experience. Do you run an updated version? /Ole On 12/6/23 10:54, Xaver Stiensmeier wrote: Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It might be ignored though. You can also give a reason when defining the states POWER_UP and POWER_DOWN. Slurm's documentation is not always giving all information. We run our solution for about a year now so I don't think there's a general problem (as in something that necessarily occurs) with the command. But I will take a closer look. I really feel like it has to be something more conditional though as otherwise the error would've occurred more often (i.e. every time when handling a fail and the command is execute). >> IHTH, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Mobile: (+45) 5180 1620 Your repository would've been really helpful for me when we started>> IHTH, Ole -- Ole Holm Nielsen PhD, Senior HPC Officer Department of Physics, Technical University of Denmark, Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark E-mail: ole.h.niel...@fysik.dtu.dk Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/ Mobile: (+45) 5180 1620 implementing the cloud scheduling, but I feel like we have implemented most things you mention there already. But I will take a look at `DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find out; SLURM plans/planned to change that in the future (cloud key behaves different than any other key in PrivateData). Of course our setup differs a little in the details. Best regards Xaver On 06.12.23 10:30, Ole Holm Nielsen wrote: Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs. Maybe someone has a great idea how to tackle this problem. Probably you can't assign a "reason" when you update a node with state=RESUME. The scontrol manual page says: Reason= Identify the reason the node is in a "DOWN", "DRAINED", "DRAINING", "FAILING" or "FAIL" state. Maybe you will find some useful hints in my Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving and in my power saving tools at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Ole, I will double check, but I am very sure that giving a reason is possible as it has been done at least 20 other times without error during that exact run. It might be ignored though. You can also give a reason when defining the states POWER_UP and POWER_DOWN. Slurm's documentation is not always giving all information. We run our solution for about a year now so I don't think there's a general problem (as in something that necessarily occurs) with the command. But I will take a closer look. I really feel like it has to be something more conditional though as otherwise the error would've occurred more often (i.e. every time when handling a fail and the command is execute). Your repository would've been really helpful for me when we started implementing the cloud scheduling, but I feel like we have implemented most things you mention there already. But I will take a look at `DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find out; SLURM plans/planned to change that in the future (cloud key behaves different than any other key in PrivateData). Of course our setup differs a little in the details. Best regards Xaver On 06.12.23 10:30, Ole Holm Nielsen wrote: Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs. Maybe someone has a great idea how to tackle this problem. Probably you can't assign a "reason" when you update a node with state=RESUME. The scontrol manual page says: Reason= Identify the reason the node is in a "DOWN", "DRAINED", "DRAINING", "FAILING" or "FAIL" state. Maybe you will find some useful hints in my Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving and in my power saving tools at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save IHTH, Ole
Re: [slurm-users] Power Save: When is RESUME an invalid node state?
Hi Xavier, On 12/6/23 09:28, Xaver Stiensmeier wrote: using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs. Maybe someone has a great idea how to tackle this problem. Probably you can't assign a "reason" when you update a node with state=RESUME. The scontrol manual page says: Reason= Identify the reason the node is in a "DOWN", "DRAINED", "DRAINING", "FAILING" or "FAIL" state. Maybe you will find some useful hints in my Wiki page https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving and in my power saving tools at https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save IHTH, Ole
[slurm-users] Power Save: When is RESUME an invalid node state?
Dear Slurm User list, using https://slurm.schedmd.com/power_save.html we had one case out of many (>242) node starts that resulted in |slurm_update error: Invalid node state specified| when we called: |scontrol update NodeName="$1" state=RESUME reason=FailedStartup| in the Fail script. We run this to make 100% sure that the instances - that are created on demand - are again `~idle` after being removed by the fail program. They are set to RESUME before the actual instance gets destroyed. I remember that I had this case manually before, but I don't remember when it occurs. Maybe someone has a great idea how to tackle this problem. Best regards Xaver Stiensmeier