What is meant by "silently rebooting"  is that the node never become 
non-responsive, 
it just unexpectedly rebooted. At a minimum that means any jobs previously 
running
on the node are gone. The obvious question is should the node be marked down 
then.
I can tell you that has been the case since slurm v2.1 was available and it 
seems to 
work fine. I'll try to clarify in the documentation.

The "etc." means any other reason that a node was put into a DOWN state and 
that 
could be any reason a sys admin wanted to specify.
________________________________________
From: [email protected] [[email protected]] On Behalf 
Of [email protected] [[email protected]]
Sent: Wednesday, June 01, 2011 2:02 PM
To: [email protected]
Subject: [slurm-dev] Question on slurm.conf "ReturnToService" parameter

The 'man' file for 'slumr.conf' shows the following for the "ReturnToService" 
parameter:

   ReturnToService
     Controls when a DOWN node will  be  returned  to  service.   The
     default value is 0.  Supported values include

     0   A node will remain in the DOWN state until a system adminis-
                  trator explicitly changes its state (even if the slurmd dae-
                  mon registers and resumes communications).
     1   A  DOWN node will become available for use upon registration
                  with a valid configuration only if it was set  DOWN  due  to
                  being  non-responsive.   If  the  node  was set DOWN for any
                  other reason (low memory, prolog  failure,  epilog  failure,
                  silently  rebooting, etc.), its state will not automatically
                  be changed.
     2   A DOWN node will become available for use upon  registration
                  with  a  valid  configuration.  The node could have been set
                  DOWN for any reason.

My question concerns the "silently rebooting" reason, that is specifically 
mentioned as being a reason that precludes making the node available again when 
the value of ReturnToService is set to "1".   Can someone explain what this 
means,  and why it should be considered different than a node that goes down, 
is declared "non-responsive" and then comes back up and registers with 
slurmctld again?   I can see the reason for treating "low memory" or "prolog 
failure" as non-automatic recoveries,  but why just rebooting, especially if it 
then registers with a valid configuration?

A second question is what other conditions result in this behavior;  i.e.,  
what is hiding behind the "etc" in option 1?

        -Don Albert-

Reply via email to