Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-25 Thread Tim Schneider

Hi Ole,

thanks for your reply.

The curious thing is that when I run "scontrol reboot nextstate=RESUME 
", the drain flag of that node is not set (sinfo shows mix@ and 
"scontrol show node " shows no DRAIN in State, just 
MIXED+REBOOT_REQUESTED), yet no jobs are scheduled on that node until 
reboot. If I specifically request that node for a job with "-w ", 
I get "Nodes required for job are DOWN, DRAINED or reserved for jobs in 
higher priority partitions".


Not using nextstate=RESUME is inconvenient for me as sometimes we have 
parts of our cluster drained and I would like to run a single command 
that reboots all non-drained nodes once they become idle and all drained 
nodes immediately, resuming them once they are done reinstalling.


Best,

Tim

On 25.10.23 14:59, Ole Holm Nielsen wrote:

Hi Tim,

I think the scontrol manual page explains the "scontrol reboot" function
fairly well:


reboot  [ASAP]  [nextstate={RESUME|DOWN}] [reason=]
{ALL|}
   Reboot the nodes in the system when they become idle  using  the
   RebootProgram  as  configured  in Slurm's slurm.conf file.  Each
   node will have the "REBOOT" flag added to its node state.  After
   a  node  reboots  and  the  slurmd  daemon  starts up again, the
   HealthCheckProgram will run once. Then, the slurmd  daemon  will
   register  itself with the slurmctld daemon and the "REBOOT" flag
   will be cleared.  The node's "DRAIN" state flag will be  cleared
   if  the reboot was "ASAP", nextstate=resume or down.  The "ASAP"
   option adds the "DRAIN" flag to each  node's  state,  preventing
   additional  jobs  from running on the node so it can be rebooted
   and returned to service  "As  Soon  As  Possible"  (i.e.  ASAP).

It seems to be implicitly understood that if nextstate is specified, this
implies setting the "DRAIN" state flag:


The node's "DRAIN" state flag will be  cleared if the reboot was "ASAP", 
nextstate=resume or down.

You can verify the node's DRAIN flag with "scontrol show node ".

IMHO, if you want nodes to continue accepting new jobs, then nextstate is
irrelevant.

We always use "reboot ASAP" because our cluster is usually so busy that
nodes never become idle if left to themselves :-)

FYI: We regularly make package updates and firmware updates using the
"scontrol reboot asap" method which is explained in this script:
https://github.com/OleHolmNielsen/Slurm_tools/blob/master/nodes/update.sh

Best regards,
Ole,
Ole


On 10/25/23 13:39, Tim Schneider wrote:

Hi Chris,

thanks a lot for your response.

I just realized that I made a mistake in my post. In the section you cite,
the command is supposed to be "scontrol reboot nextstate=RESUME" (without
ASAP).

So to clarify: my problem is that if I type "scontrol reboot
nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On
the other hand, if I type "scontrol reboot", jobs continue to get
scheduled, which is what I want. I just don't understand, why setting
nextstate results in the nodes not accepting jobs anymore.

My usecase is similar to the one you describe. We use the ASAP option when
we install a new image to ensure that from the point of the reinstallation
onwards, all jobs end up on nodes with the new configuration only.
However, in some cases when we do only minor changes to the image
configuration, we prefer to cause as little disruption as possible and
just reinstall the nodes whenever they are idle. Here, being able to set
nextstate=RESUME is useful, since we usually want the nodes to resume
after reinstallation, no matter what their previous state was.

Hope that clears it up and sorry for the confusion!

Best,

tim

On 25.10.23 02:10, Christopher Samuel wrote:

On 10/24/23 12:39, Tim Schneider wrote:


Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME
", the node goes in "mix@" state (not drain), but no new jobs get
scheduled until the node reboots. Essentially I get draining behavior,
even though the node's state is not "drain". Note that this behavior is
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled
as expected. Does anyone have an idea why that could be?

The intent of the "ASAP` flag for "scontrol reboot" is to not let any
more jobs onto a node until it has rebooted.

IIRC that was from work we sponsored, the idea being that (for how our
nodes are managed) we would build new images with the latest software
stack, test them on a separate test system and then once happy bring
them over to the production system and do an "scontrol reboot ASAP
nextstate=resume reason=... $NODES" to ensure that from that point
onwards no new jobs would start in the old software configuration, only
the new one.

Also slurmctld would know that these nodes are due to come back in
"ResumeTimeout" seconds after the reboot is issued and so could plan for
them as part of 

Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-25 Thread Ole Holm Nielsen

Hi Tim,

I think the scontrol manual page explains the "scontrol reboot" function 
fairly well:



   reboot  [ASAP]  [nextstate={RESUME|DOWN}] [reason=]
   {ALL|}
  Reboot the nodes in the system when they become idle  using  the
  RebootProgram  as  configured  in Slurm's slurm.conf file.  Each
  node will have the "REBOOT" flag added to its node state.  After
  a  node  reboots  and  the  slurmd  daemon  starts up again, the
  HealthCheckProgram will run once. Then, the slurmd  daemon  will
  register  itself with the slurmctld daemon and the "REBOOT" flag
  will be cleared.  The node's "DRAIN" state flag will be  cleared
  if  the reboot was "ASAP", nextstate=resume or down.  The "ASAP"
  option adds the "DRAIN" flag to each  node's  state,  preventing
  additional  jobs  from running on the node so it can be rebooted
  and returned to service  "As  Soon  As  Possible"  (i.e.  ASAP).


It seems to be implicitly understood that if nextstate is specified, this 
implies setting the "DRAIN" state flag:


The node's "DRAIN" state flag will be  cleared if the reboot was "ASAP", nextstate=resume or down. 


You can verify the node's DRAIN flag with "scontrol show node ".

IMHO, if you want nodes to continue accepting new jobs, then nextstate is 
irrelevant.


We always use "reboot ASAP" because our cluster is usually so busy that 
nodes never become idle if left to themselves :-)


FYI: We regularly make package updates and firmware updates using the 
"scontrol reboot asap" method which is explained in this script:

https://github.com/OleHolmNielsen/Slurm_tools/blob/master/nodes/update.sh

Best regards,
Ole,
Ole


On 10/25/23 13:39, Tim Schneider wrote:

Hi Chris,

thanks a lot for your response.

I just realized that I made a mistake in my post. In the section you cite, 
the command is supposed to be "scontrol reboot nextstate=RESUME" (without 
ASAP).


So to clarify: my problem is that if I type "scontrol reboot 
nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On 
the other hand, if I type "scontrol reboot", jobs continue to get 
scheduled, which is what I want. I just don't understand, why setting 
nextstate results in the nodes not accepting jobs anymore.


My usecase is similar to the one you describe. We use the ASAP option when 
we install a new image to ensure that from the point of the reinstallation 
onwards, all jobs end up on nodes with the new configuration only. 
However, in some cases when we do only minor changes to the image 
configuration, we prefer to cause as little disruption as possible and 
just reinstall the nodes whenever they are idle. Here, being able to set 
nextstate=RESUME is useful, since we usually want the nodes to resume 
after reinstallation, no matter what their previous state was.


Hope that clears it up and sorry for the confusion!

Best,

tim

On 25.10.23 02:10, Christopher Samuel wrote:

On 10/24/23 12:39, Tim Schneider wrote:


Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME
", the node goes in "mix@" state (not drain), but no new jobs get
scheduled until the node reboots. Essentially I get draining behavior,
even though the node's state is not "drain". Note that this behavior is
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled
as expected. Does anyone have an idea why that could be?

The intent of the "ASAP` flag for "scontrol reboot" is to not let any
more jobs onto a node until it has rebooted.

IIRC that was from work we sponsored, the idea being that (for how our
nodes are managed) we would build new images with the latest software
stack, test them on a separate test system and then once happy bring
them over to the production system and do an "scontrol reboot ASAP
nextstate=resume reason=... $NODES" to ensure that from that point
onwards no new jobs would start in the old software configuration, only
the new one.

Also slurmctld would know that these nodes are due to come back in
"ResumeTimeout" seconds after the reboot is issued and so could plan for
them as part of scheduling large jobs, rather than thinking there was no
way it could do so and letting lots of smaller jobs get in the way.




Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-25 Thread Tim Schneider

Hi Chris,

thanks a lot for your response.

I just realized that I made a mistake in my post. In the section you 
cite, the command is supposed to be "scontrol reboot nextstate=RESUME" 
(without ASAP).


So to clarify: my problem is that if I type "scontrol reboot 
nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On 
the other hand, if I type "scontrol reboot", jobs continue to get 
scheduled, which is what I want. I just don't understand, why setting 
nextstate results in the nodes not accepting jobs anymore.


My usecase is similar to the one you describe. We use the ASAP option 
when we install a new image to ensure that from the point of the 
reinstallation onwards, all jobs end up on nodes with the new 
configuration only. However, in some cases when we do only minor changes 
to the image configuration, we prefer to cause as little disruption as 
possible and just reinstall the nodes whenever they are idle. Here, 
being able to set nextstate=RESUME is useful, since we usually want the 
nodes to resume after reinstallation, no matter what their previous 
state was.


Hope that clears it up and sorry for the confusion!

Best,

tim

On 25.10.23 02:10, Christopher Samuel wrote:

On 10/24/23 12:39, Tim Schneider wrote:


Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME
", the node goes in "mix@" state (not drain), but no new jobs get
scheduled until the node reboots. Essentially I get draining behavior,
even though the node's state is not "drain". Note that this behavior is
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled
as expected. Does anyone have an idea why that could be?

The intent of the "ASAP` flag for "scontrol reboot" is to not let any
more jobs onto a node until it has rebooted.

IIRC that was from work we sponsored, the idea being that (for how our
nodes are managed) we would build new images with the latest software
stack, test them on a separate test system and then once happy bring
them over to the production system and do an "scontrol reboot ASAP
nextstate=resume reason=... $NODES" to ensure that from that point
onwards no new jobs would start in the old software configuration, only
the new one.

Also slurmctld would know that these nodes are due to come back in
"ResumeTimeout" seconds after the reboot is issued and so could plan for
them as part of scheduling large jobs, rather than thinking there was no
way it could do so and letting lots of smaller jobs get in the way.

Hope that helps!

All the best,
Chris




Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-24 Thread Christopher Samuel

On 10/24/23 12:39, Tim Schneider wrote:

Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME 
", the node goes in "mix@" state (not drain), but no new jobs get 
scheduled until the node reboots. Essentially I get draining behavior, 
even though the node's state is not "drain". Note that this behavior is 
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled 
as expected. Does anyone have an idea why that could be?


The intent of the "ASAP` flag for "scontrol reboot" is to not let any 
more jobs onto a node until it has rebooted.


IIRC that was from work we sponsored, the idea being that (for how our 
nodes are managed) we would build new images with the latest software 
stack, test them on a separate test system and then once happy bring 
them over to the production system and do an "scontrol reboot ASAP 
nextstate=resume reason=... $NODES" to ensure that from that point 
onwards no new jobs would start in the old software configuration, only 
the new one.


Also slurmctld would know that these nodes are due to come back in 
"ResumeTimeout" seconds after the reboot is issued and so could plan for 
them as part of scheduling large jobs, rather than thinking there was no 
way it could do so and letting lots of smaller jobs get in the way.


Hope that helps!

All the best,
Chris
--
Chris Samuel  :  http://www.csamuel.org/  :  Berkeley, CA, USA




[slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set

2023-10-24 Thread Tim Schneider

Hi,

from my understanding, if I run "scontrol reboot ", the node 
should continue to operate as usual and reboots once it is idle. When 
adding the ASAP flag (scontrol reboot ASAP ), the node should go 
into drain state and not accept any more jobs.


Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME 
", the node goes in "mix@" state (not drain), but no new jobs get 
scheduled until the node reboots. Essentially I get draining behavior, 
even though the node's state is not "drain". Note that this behavior is 
caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled 
as expected. Does anyone have an idea why that could be?


I am running slurm 22.05.9.

Steps to reproduce:

# To prevent node from rebooting immediately
sbatch -t 1:00:00 -c 1 --mem-per-cpu 1G -w  ./long_running_script.sh

# Request reboot
scontrol reboot nextstate=RESUME 

# Run interactive command, which does not start until "scontrol 
cancel_reboot " is executed in another shell

srun -t 1:00:00 -c 1 --mem-per-cpu 1G -w  --pty bash


Thanks a lot in advance!

Best,

Tim