The 'man' file for 'slumr.conf' shows the following for the 
"ReturnToService" parameter:

   ReturnToService
      Controls when a DOWN node will  be  returned  to  service.   The
      default value is 0.  Supported values include

      0   A node will remain in the DOWN state until a system adminis-
                   trator explicitly changes its state (even if the slurmd 
dae-
                   mon registers and resumes communications).
      1   A  DOWN node will become available for use upon registration
                   with a valid configuration only if it was set  DOWN due 
 to
                   being  non-responsive.   If  the  node  was set DOWN 
for any
                   other reason (low memory, prolog  failure,  epilog 
failure,
                   silently  rebooting, etc.), its state will not 
automatically
                   be changed.
      2   A DOWN node will become available for use upon  registration
                   with  a  valid  configuration.  The node could have 
been set
                   DOWN for any reason.

My question concerns the "silently rebooting" reason, that is specifically 
mentioned as being a reason that precludes making the node available again 
when the value of ReturnToService is set to "1".   Can someone explain 
what this means,  and why it should be considered different than a node 
that goes down, is declared "non-responsive" and then comes back up and 
registers with slurmctld again?   I can see the reason for treating "low 
memory" or "prolog failure" as non-automatic recoveries,  but why just 
rebooting, especially if it then registers with a valid configuration?

A second question is what other conditions result in this behavior;  i.e., 
 what is hiding behind the "etc" in option 1?

        -Don Albert-

Reply via email to