Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set
Hi Ole, thanks for your reply. The curious thing is that when I run "scontrol reboot nextstate=RESUME ", the drain flag of that node is not set (sinfo shows mix@ and "scontrol show node " shows no DRAIN in State, just MIXED+REBOOT_REQUESTED), yet no jobs are scheduled on that node until reboot. If I specifically request that node for a job with "-w ", I get "Nodes required for job are DOWN, DRAINED or reserved for jobs in higher priority partitions". Not using nextstate=RESUME is inconvenient for me as sometimes we have parts of our cluster drained and I would like to run a single command that reboots all non-drained nodes once they become idle and all drained nodes immediately, resuming them once they are done reinstalling. Best, Tim On 25.10.23 14:59, Ole Holm Nielsen wrote: Hi Tim, I think the scontrol manual page explains the "scontrol reboot" function fairly well: reboot [ASAP] [nextstate={RESUME|DOWN}] [reason=] {ALL|} Reboot the nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file. Each node will have the "REBOOT" flag added to its node state. After a node reboots and the slurmd daemon starts up again, the HealthCheckProgram will run once. Then, the slurmd daemon will register itself with the slurmctld daemon and the "REBOOT" flag will be cleared. The node's "DRAIN" state flag will be cleared if the reboot was "ASAP", nextstate=resume or down. The "ASAP" option adds the "DRAIN" flag to each node's state, preventing additional jobs from running on the node so it can be rebooted and returned to service "As Soon As Possible" (i.e. ASAP). It seems to be implicitly understood that if nextstate is specified, this implies setting the "DRAIN" state flag: The node's "DRAIN" state flag will be cleared if the reboot was "ASAP", nextstate=resume or down. You can verify the node's DRAIN flag with "scontrol show node ". IMHO, if you want nodes to continue accepting new jobs, then nextstate is irrelevant. We always use "reboot ASAP" because our cluster is usually so busy that nodes never become idle if left to themselves :-) FYI: We regularly make package updates and firmware updates using the "scontrol reboot asap" method which is explained in this script: https://github.com/OleHolmNielsen/Slurm_tools/blob/master/nodes/update.sh Best regards, Ole, Ole On 10/25/23 13:39, Tim Schneider wrote: Hi Chris, thanks a lot for your response. I just realized that I made a mistake in my post. In the section you cite, the command is supposed to be "scontrol reboot nextstate=RESUME" (without ASAP). So to clarify: my problem is that if I type "scontrol reboot nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On the other hand, if I type "scontrol reboot", jobs continue to get scheduled, which is what I want. I just don't understand, why setting nextstate results in the nodes not accepting jobs anymore. My usecase is similar to the one you describe. We use the ASAP option when we install a new image to ensure that from the point of the reinstallation onwards, all jobs end up on nodes with the new configuration only. However, in some cases when we do only minor changes to the image configuration, we prefer to cause as little disruption as possible and just reinstall the nodes whenever they are idle. Here, being able to set nextstate=RESUME is useful, since we usually want the nodes to resume after reinstallation, no matter what their previous state was. Hope that clears it up and sorry for the confusion! Best, tim On 25.10.23 02:10, Christopher Samuel wrote: On 10/24/23 12:39, Tim Schneider wrote: Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME ", the node goes in "mix@" state (not drain), but no new jobs get scheduled until the node reboots. Essentially I get draining behavior, even though the node's state is not "drain". Note that this behavior is caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled as expected. Does anyone have an idea why that could be? The intent of the "ASAP` flag for "scontrol reboot" is to not let any more jobs onto a node until it has rebooted. IIRC that was from work we sponsored, the idea being that (for how our nodes are managed) we would build new images with the latest software stack, test them on a separate test system and then once happy bring them over to the production system and do an "scontrol reboot ASAP nextstate=resume reason=... $NODES" to ensure that from that point onwards no new jobs would start in the old software configuration, only the new one. Also slurmctld would know that these nodes are due to come back in "ResumeTimeout" seconds after the reboot is issued and so could plan for them as part of sc
Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set
Hi Tim, I think the scontrol manual page explains the "scontrol reboot" function fairly well: reboot [ASAP] [nextstate={RESUME|DOWN}] [reason=] {ALL|} Reboot the nodes in the system when they become idle using the RebootProgram as configured in Slurm's slurm.conf file. Each node will have the "REBOOT" flag added to its node state. After a node reboots and the slurmd daemon starts up again, the HealthCheckProgram will run once. Then, the slurmd daemon will register itself with the slurmctld daemon and the "REBOOT" flag will be cleared. The node's "DRAIN" state flag will be cleared if the reboot was "ASAP", nextstate=resume or down. The "ASAP" option adds the "DRAIN" flag to each node's state, preventing additional jobs from running on the node so it can be rebooted and returned to service "As Soon As Possible" (i.e. ASAP). It seems to be implicitly understood that if nextstate is specified, this implies setting the "DRAIN" state flag: The node's "DRAIN" state flag will be cleared if the reboot was "ASAP", nextstate=resume or down. You can verify the node's DRAIN flag with "scontrol show node ". IMHO, if you want nodes to continue accepting new jobs, then nextstate is irrelevant. We always use "reboot ASAP" because our cluster is usually so busy that nodes never become idle if left to themselves :-) FYI: We regularly make package updates and firmware updates using the "scontrol reboot asap" method which is explained in this script: https://github.com/OleHolmNielsen/Slurm_tools/blob/master/nodes/update.sh Best regards, Ole, Ole On 10/25/23 13:39, Tim Schneider wrote: Hi Chris, thanks a lot for your response. I just realized that I made a mistake in my post. In the section you cite, the command is supposed to be "scontrol reboot nextstate=RESUME" (without ASAP). So to clarify: my problem is that if I type "scontrol reboot nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On the other hand, if I type "scontrol reboot", jobs continue to get scheduled, which is what I want. I just don't understand, why setting nextstate results in the nodes not accepting jobs anymore. My usecase is similar to the one you describe. We use the ASAP option when we install a new image to ensure that from the point of the reinstallation onwards, all jobs end up on nodes with the new configuration only. However, in some cases when we do only minor changes to the image configuration, we prefer to cause as little disruption as possible and just reinstall the nodes whenever they are idle. Here, being able to set nextstate=RESUME is useful, since we usually want the nodes to resume after reinstallation, no matter what their previous state was. Hope that clears it up and sorry for the confusion! Best, tim On 25.10.23 02:10, Christopher Samuel wrote: On 10/24/23 12:39, Tim Schneider wrote: Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME ", the node goes in "mix@" state (not drain), but no new jobs get scheduled until the node reboots. Essentially I get draining behavior, even though the node's state is not "drain". Note that this behavior is caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled as expected. Does anyone have an idea why that could be? The intent of the "ASAP` flag for "scontrol reboot" is to not let any more jobs onto a node until it has rebooted. IIRC that was from work we sponsored, the idea being that (for how our nodes are managed) we would build new images with the latest software stack, test them on a separate test system and then once happy bring them over to the production system and do an "scontrol reboot ASAP nextstate=resume reason=... $NODES" to ensure that from that point onwards no new jobs would start in the old software configuration, only the new one. Also slurmctld would know that these nodes are due to come back in "ResumeTimeout" seconds after the reboot is issued and so could plan for them as part of scheduling large jobs, rather than thinking there was no way it could do so and letting lots of smaller jobs get in the way.
Re: [slurm-users] scontrol reboot does not allow new jobs to be scheduled if nextstate=RESUME is set
Hi Chris, thanks a lot for your response. I just realized that I made a mistake in my post. In the section you cite, the command is supposed to be "scontrol reboot nextstate=RESUME" (without ASAP). So to clarify: my problem is that if I type "scontrol reboot nextstate=RESUME" no new jobs get scheduled anymore until the reboot. On the other hand, if I type "scontrol reboot", jobs continue to get scheduled, which is what I want. I just don't understand, why setting nextstate results in the nodes not accepting jobs anymore. My usecase is similar to the one you describe. We use the ASAP option when we install a new image to ensure that from the point of the reinstallation onwards, all jobs end up on nodes with the new configuration only. However, in some cases when we do only minor changes to the image configuration, we prefer to cause as little disruption as possible and just reinstall the nodes whenever they are idle. Here, being able to set nextstate=RESUME is useful, since we usually want the nodes to resume after reinstallation, no matter what their previous state was. Hope that clears it up and sorry for the confusion! Best, tim On 25.10.23 02:10, Christopher Samuel wrote: On 10/24/23 12:39, Tim Schneider wrote: Now my issue is that when I run "scontrol reboot ASAP nextstate=RESUME ", the node goes in "mix@" state (not drain), but no new jobs get scheduled until the node reboots. Essentially I get draining behavior, even though the node's state is not "drain". Note that this behavior is caused by "nextstate=RESUME"; if I leave that away, jobs get scheduled as expected. Does anyone have an idea why that could be? The intent of the "ASAP` flag for "scontrol reboot" is to not let any more jobs onto a node until it has rebooted. IIRC that was from work we sponsored, the idea being that (for how our nodes are managed) we would build new images with the latest software stack, test them on a separate test system and then once happy bring them over to the production system and do an "scontrol reboot ASAP nextstate=resume reason=... $NODES" to ensure that from that point onwards no new jobs would start in the old software configuration, only the new one. Also slurmctld would know that these nodes are due to come back in "ResumeTimeout" seconds after the reboot is issued and so could plan for them as part of scheduling large jobs, rather than thinking there was no way it could do so and letting lots of smaller jobs get in the way. Hope that helps! All the best, Chris