Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-07 Thread Stefan Staeglich
Hi Xaver,

we also had a similar problem with Slurm 21.08 (see thread "error: power_save 
module disabled, NULL SuspendProgram").

Fortunately, we have not yet observed this since the upgrade to 23.02. But the 
time period  (about a month) is still too short to know if the problem is 
really fixed as we are still in the normal recurrence period of that event.

Best regards,
Stefan

Am Mittwoch, 6. Dezember 2023, 12:14:46 CET schrieb Xaver Stiensmeier:
> Hi Ole,
> 
> for multiple reasons we build it ourself, but I am not really involved
> in that process, but I will contact the person who is. Thanks for the
> recommendation! We should probably implement a regular check whether
> there is a new slurm version. I am not 100% whether this will fix our
> issues or not, but it's worth a try.
> 
> Best regards
> Xaver
> 
> On 06.12.23 12:03, Ole Holm Nielsen wrote:
> > On 12/6/23 11:51, Xaver Stiensmeier wrote:
> >> Good idea. Here's our current version:
> >> 
> >> ```
> >> sinfo -V
> >> slurm 22.05.7
> >> ```
> >> 
> >> Quick googling told me that the latest version is 23.11. Does the
> >> upgrade change anything in that regard? I will keep reading.
> > 
> > There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving
> > Power with Slurm" at https://slurm.schedmd.com/publications.html
> > 
> > For reasons of security and functionality it is recommended to follow
> > Slurm's releases (maybe not the first few minor versions of new major
> > releases like 23.11).  FYI, I've collected information about upgrading
> > Slurm in the Wiki page
> > https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-sl
> > urm
> > 
> > /Ole


-- 
Albert-Ludwigs-Universität Freiburg
Institut für Informatik
Professur für Maschinelles Lernen

Stefan Stäglich
System-Administrator

T +49 761 203-8223

staeg...@informatik.uni-freiburg.de
https://ml.informatik.uni-freiburg.de

Georges-Köhler-Allee 52
D-79110 Freiburg


smime.p7s
Description: S/MIME cryptographic signature


Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier

Hi Ole,

for multiple reasons we build it ourself, but I am not really involved
in that process, but I will contact the person who is. Thanks for the
recommendation! We should probably implement a regular check whether
there is a new slurm version. I am not 100% whether this will fix our
issues or not, but it's worth a try.

Best regards
Xaver

On 06.12.23 12:03, Ole Holm Nielsen wrote:

On 12/6/23 11:51, Xaver Stiensmeier wrote:

Good idea. Here's our current version:

```
sinfo -V
slurm 22.05.7
```

Quick googling told me that the latest version is 23.11. Does the
upgrade change anything in that regard? I will keep reading.


There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving
Power with Slurm" at https://slurm.schedmd.com/publications.html

For reasons of security and functionality it is recommended to follow
Slurm's releases (maybe not the first few minor versions of new major
releases like 23.11).  FYI, I've collected information about upgrading
Slurm in the Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm

/Ole





Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen

On 12/6/23 11:51, Xaver Stiensmeier wrote:

Good idea. Here's our current version:

```
sinfo -V
slurm 22.05.7
```

Quick googling told me that the latest version is 23.11. Does the
upgrade change anything in that regard? I will keep reading.


There are nice bug fixes in 23.02 mentioned in my SLUG'23 talk "Saving 
Power with Slurm" at https://slurm.schedmd.com/publications.html


For reasons of security and functionality it is recommended to follow 
Slurm's releases (maybe not the first few minor versions of new major 
releases like 23.11).  FYI, I've collected information about upgrading 
Slurm in the Wiki page 
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_installation/#upgrading-slurm


/Ole



Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier

Hi Ole,

Good idea. Here's our current version:

```
sinfo -V
slurm 22.05.7
```

Quick googling told me that the latest version is 23.11. Does the
upgrade change anything in that regard? I will keep reading.

Xaver

On 06.12.23 11:09, Ole Holm Nielsen wrote:

Hi Xaver,

Your version of Slurm may matter for your power saving experience.  Do
you run an updated version?

/Ole

On 12/6/23 10:54, Xaver Stiensmeier wrote:

Hi Ole,

I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not always giving all information. We run our solution for about a year
now so I don't think there's a general problem (as in something that
necessarily occurs) with the command. But I will take a closer look. I
really feel like it has to be something more conditional though as
otherwise the error would've occurred more often (i.e. every time when
handling a fail and the command is execute).
>>

IHTH,
Ole








Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen

Hi Xaver,

Your version of Slurm may matter for your power saving experience.  Do you 
run an updated version?


/Ole

On 12/6/23 10:54, Xaver Stiensmeier wrote:

Hi Ole,

I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not always giving all information. We run our solution for about a year
now so I don't think there's a general problem (as in something that
necessarily occurs) with the command. But I will take a closer look. I
really feel like it has to be something more conditional though as
otherwise the error would've occurred more often (i.e. every time when
handling a fail and the command is execute).
>>

IHTH,
Ole





--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

Your repository would've been really helpful for me when we started>>

IHTH,
Ole





--
Ole Holm Nielsen
PhD, Senior HPC Officer
Department of Physics, Technical University of Denmark,
Fysikvej Building 309, DK-2800 Kongens Lyngby, Denmark
E-mail: ole.h.niel...@fysik.dtu.dk
Homepage: http://dcwww.fysik.dtu.dk/~ohnielse/
Mobile: (+45) 5180 1620

implementing the cloud scheduling, but I feel like we have implemented
most things you mention there already. But I will take a look at
`DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find
out; SLURM plans/planned to change that in the future (cloud key behaves
different than any other key in PrivateData). Of course our setup
differs a little in the details.

Best regards
Xaver

On 06.12.23 10:30, Ole Holm Nielsen wrote:

Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:

using https://slurm.schedmd.com/power_save.html we had one case out
of many (>242) node starts that resulted in

|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances
- that are created on demand - are again `~idle` after being removed
by the fail program. They are set to RESUME before the actual
instance gets destroyed. I remember that I had this case manually
before, but I don't remember when it occurs.

Maybe someone has a great idea how to tackle this problem.


Probably you can't assign a "reason" when you update a node with
state=RESUME.  The scontrol manual page says:

Reason= Identify the reason the node is in a "DOWN",
"DRAINED", "DRAINING", "FAILING" or "FAIL" state.

Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving

and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save




Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier

Hi Ole,

I will double check, but I am very sure that giving a reason is possible
as it has been done at least 20 other times without error during that
exact run. It might be ignored though. You can also give a reason when
defining the states POWER_UP and POWER_DOWN. Slurm's documentation is
not always giving all information. We run our solution for about a year
now so I don't think there's a general problem (as in something that
necessarily occurs) with the command. But I will take a closer look. I
really feel like it has to be something more conditional though as
otherwise the error would've occurred more often (i.e. every time when
handling a fail and the command is execute).

Your repository would've been really helpful for me when we started
implementing the cloud scheduling, but I feel like we have implemented
most things you mention there already. But I will take a look at
`DebugFlags=Power`. `PrivateData=cloud` was an annoying thing to find
out; SLURM plans/planned to change that in the future (cloud key behaves
different than any other key in PrivateData). Of course our setup
differs a little in the details.

Best regards
Xaver

On 06.12.23 10:30, Ole Holm Nielsen wrote:

Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:

using https://slurm.schedmd.com/power_save.html we had one case out
of many (>242) node starts that resulted in

|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances
- that are created on demand - are again `~idle` after being removed
by the fail program. They are set to RESUME before the actual
instance gets destroyed. I remember that I had this case manually
before, but I don't remember when it occurs.

Maybe someone has a great idea how to tackle this problem.


Probably you can't assign a "reason" when you update a node with
state=RESUME.  The scontrol manual page says:

Reason= Identify the reason the node is in a "DOWN",
"DRAINED", "DRAINING", "FAILING" or "FAIL" state.

Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving

and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

IHTH,
Ole






Re: [slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Ole Holm Nielsen

Hi Xavier,

On 12/6/23 09:28, Xaver Stiensmeier wrote:
using https://slurm.schedmd.com/power_save.html we had one case out of 
many (>242) node starts that resulted in


|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances - 
that are created on demand - are again `~idle` after being removed by the 
fail program. They are set to RESUME before the actual instance gets 
destroyed. I remember that I had this case manually before, but I don't 
remember when it occurs.


Maybe someone has a great idea how to tackle this problem.


Probably you can't assign a "reason" when you update a node with 
state=RESUME.  The scontrol manual page says:


Reason= Identify the reason the node is in a "DOWN", "DRAINED", 
"DRAINING", "FAILING" or "FAIL" state.


Maybe you will find some useful hints in my Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_cloud_bursting/#configuring-slurm-conf-for-power-saving
and in my power saving tools at
https://github.com/OleHolmNielsen/Slurm_tools/tree/master/power_save

IHTH,
Ole




[slurm-users] Power Save: When is RESUME an invalid node state?

2023-12-06 Thread Xaver Stiensmeier

Dear Slurm User list,

using https://slurm.schedmd.com/power_save.html we had one case out of
many (>242) node starts that resulted in

|slurm_update error: Invalid node state specified|

when we called:

|scontrol update NodeName="$1" state=RESUME reason=FailedStartup|

in the Fail script. We run this to make 100% sure that the instances -
that are created on demand - are again `~idle` after being removed by
the fail program. They are set to RESUME before the actual instance gets
destroyed. I remember that I had this case manually before, but I don't
remember when it occurs.

Maybe someone has a great idea how to tackle this problem.

Best regards
Xaver Stiensmeier