What is meant by "silently rebooting" is that the node never become non-responsive, it just unexpectedly rebooted. At a minimum that means any jobs previously running on the node are gone. The obvious question is should the node be marked down then. I can tell you that has been the case since slurm v2.1 was available and it seems to work fine. I'll try to clarify in the documentation.
The "etc." means any other reason that a node was put into a DOWN state and that could be any reason a sys admin wanted to specify. ________________________________________ From: [email protected] [[email protected]] On Behalf Of [email protected] [[email protected]] Sent: Wednesday, June 01, 2011 2:02 PM To: [email protected] Subject: [slurm-dev] Question on slurm.conf "ReturnToService" parameter The 'man' file for 'slumr.conf' shows the following for the "ReturnToService" parameter: ReturnToService Controls when a DOWN node will be returned to service. The default value is 0. Supported values include 0 A node will remain in the DOWN state until a system adminis- trator explicitly changes its state (even if the slurmd dae- mon registers and resumes communications). 1 A DOWN node will become available for use upon registration with a valid configuration only if it was set DOWN due to being non-responsive. If the node was set DOWN for any other reason (low memory, prolog failure, epilog failure, silently rebooting, etc.), its state will not automatically be changed. 2 A DOWN node will become available for use upon registration with a valid configuration. The node could have been set DOWN for any reason. My question concerns the "silently rebooting" reason, that is specifically mentioned as being a reason that precludes making the node available again when the value of ReturnToService is set to "1". Can someone explain what this means, and why it should be considered different than a node that goes down, is declared "non-responsive" and then comes back up and registers with slurmctld again? I can see the reason for treating "low memory" or "prolog failure" as non-automatic recoveries, but why just rebooting, especially if it then registers with a valid configuration? A second question is what other conditions result in this behavior; i.e., what is hiding behind the "etc" in option 1? -Don Albert-
