The 'man' file for 'slumr.conf' shows the following for the
"ReturnToService" parameter:
ReturnToService
Controls when a DOWN node will be returned to service. The
default value is 0. Supported values include
0 A node will remain in the DOWN state until a system adminis-
trator explicitly changes its state (even if the slurmd
dae-
mon registers and resumes communications).
1 A DOWN node will become available for use upon registration
with a valid configuration only if it was set DOWN due
to
being non-responsive. If the node was set DOWN
for any
other reason (low memory, prolog failure, epilog
failure,
silently rebooting, etc.), its state will not
automatically
be changed.
2 A DOWN node will become available for use upon registration
with a valid configuration. The node could have
been set
DOWN for any reason.
My question concerns the "silently rebooting" reason, that is specifically
mentioned as being a reason that precludes making the node available again
when the value of ReturnToService is set to "1". Can someone explain
what this means, and why it should be considered different than a node
that goes down, is declared "non-responsive" and then comes back up and
registers with slurmctld again? I can see the reason for treating "low
memory" or "prolog failure" as non-automatic recoveries, but why just
rebooting, especially if it then registers with a valid configuration?
A second question is what other conditions result in this behavior; i.e.,
what is hiding behind the "etc" in option 1?
-Don Albert-