First off, the assumption is that you have the same slurm.conf files across 
your management node and all the compute nodes.

The node definitions listed in your slurm.conf file become the gold standard.  
For example:

NodeName=rzmerl[1-152] NodeAddr=erzmerl[1-152] Sockets=2 CoresPerSocket=8 
RealMemory=30000 State=UNKNOWN

If you attempt to start a slurmd on a node that is not listed in your 
slurm.conf fle it will not start.

When the slurmd starts up, it reports the resources it finds to the slurmd.log. 
 For example:

[2014-04-10T08:37:59] slurmd started on Thu 10 Apr 2014 08:37:59 -0700
[2014-04-10T08:37:59] Procs=16 Sockets=2 Cores=8 Threads=1 Memory=31929 
TmpDisk=15964 Uptime=13302020

If the Procs and Memory it detects on the node do not match the NodeName info 
listed in the slurm.conf file, and ReturnToService is 2, the node will stay 
down - but only if the slurm.conf’s FastSchedule parameter is set to its 
default value of 1.

Don


From: Hill, Marti T [mailto:[email protected]]
Sent: Thursday, April 10, 2014 7:09 AM
To: slurm-dev
Subject: [slurm-dev] ReturnToService


Question about ReturnToService – how exactly does slurmd decide that a node 
“registers with a valid node configuration”?  What algorithm is used to decide 
this?


Thanks,
Marti

3. Why is a node shown in state DOWN when the node has registered for service?
The configuration parameter ReturnToService in slurm.conf controls how DOWN 
nodes are handled. Set its value to one in order for DOWN nodes to 
automatically be returned to service once the slurmd daemon registers with a 
valid node configuration. A value of zero is the default and results in a node 
staying DOWN until an administrator explicitly returns it to service using the 
command "scontrol update NodeName=whatever State=RESUME". See "man slurm.conf" 
and "man scontrol" for more details.

<<inline: image002.jpg>>

<<inline: image003.jpg>>

Reply via email to