We do the same for our compute nodes, but I prefer keep the head node
independent of any other servers to ensure that it can boot up and
function fully by itself. I've spent plenty of time untangling nasty
messes caused by an crashed NFS server holding system files for other nodes.
Cheers,
JB
On 3/11/14 7:17 AM, Andy Riebs wrote:
We keep slurm.conf on a shared NFS partition, which may be adding some
noise to the situation, but it does make it easier to propagate
system-wide changes.
Andy
On 03/10/2014 09:44 PM, Williams, Kevin E. (Federal SIP) wrote:
Hey there. Are you not using /etc/slurm as the dir for the conf
files? Also, if the files differ across the cluster, then the flag to
ignore differences needs to be set in the files. I assume that you
copied out the edited file to all nodes prior to the reconfig. Other
than that, I have not seen any issues when changing the config while
slurm is running and using reconfig to initiate the changes.
-----Original Message-----
From: Riebs, Andy
Sent: Monday, March 10, 2014 3:21 PM
To: slurm-dev
Subject: [slurm-dev] slurmd crashed on *some* nodes after "scontrol
reconfigure"
I had edited slurm.conf to create a couple of new slurm partitions.
In what appears to be a flukey coincidence, the slurmd daemons on 32
contiguous nodes apparently failed while slurmctld was reconfiguring
itself.
The slurmctld log,
[2014-03-10T12:36:23.428] Processing RPC: REQUEST_RECONFIGURE from
uid=0 [2014-03-10T12:36:23.484] restoring original state of nodes
[2014-03-10T12:36:23.502] read_slurm_conf: backup_controller not
specified.
[2014-03-10T12:36:23.779] _slurm_rpc_reconfigure_controller: completed
usec=351616
From one of the compute nodes,
[2014-03-10T12:36:23.666] s_p_parse_file: unable to status file
"/home/slurm/slurm.conf"
[2014-03-10T12:36:23.666] fatal: something wrong with opening/reading
conf file
Has anyone seen this before? slurm.conf is on an NFS server, so it's
possible we've got a configuration error there.
In any event, I'm wondering if slurmd should retry when it sees this
failure. None of the other nodes were apparently affected.
Cheers
Andy
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP
--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Jason W. Bacon
[email protected]
Circumstances don't make a man:
They reveal him.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~