First off, the assumption is that you have the same slurm.conf files across your management node and all the compute nodes.
The node definitions listed in your slurm.conf file become the gold standard. For example: NodeName=rzmerl[1-152] NodeAddr=erzmerl[1-152] Sockets=2 CoresPerSocket=8 RealMemory=30000 State=UNKNOWN If you attempt to start a slurmd on a node that is not listed in your slurm.conf fle it will not start. When the slurmd starts up, it reports the resources it finds to the slurmd.log. For example: [2014-04-10T08:37:59] slurmd started on Thu 10 Apr 2014 08:37:59 -0700 [2014-04-10T08:37:59] Procs=16 Sockets=2 Cores=8 Threads=1 Memory=31929 TmpDisk=15964 Uptime=13302020 If the Procs and Memory it detects on the node do not match the NodeName info listed in the slurm.conf file, and ReturnToService is 2, the node will stay down - but only if the slurm.conf’s FastSchedule parameter is set to its default value of 1. Don From: Hill, Marti T [mailto:[email protected]] Sent: Thursday, April 10, 2014 7:09 AM To: slurm-dev Subject: [slurm-dev] ReturnToService Question about ReturnToService – how exactly does slurmd decide that a node “registers with a valid node configuration”? What algorithm is used to decide this? Thanks, Marti 3. Why is a node shown in state DOWN when the node has registered for service? The configuration parameter ReturnToService in slurm.conf controls how DOWN nodes are handled. Set its value to one in order for DOWN nodes to automatically be returned to service once the slurmd daemon registers with a valid node configuration. A value of zero is the default and results in a node staying DOWN until an administrator explicitly returns it to service using the command "scontrol update NodeName=whatever State=RESUME". See "man slurm.conf" and "man scontrol" for more details.
<<inline: image002.jpg>>
<<inline: image003.jpg>>
