We do the same for our compute nodes, but I prefer keep the head node independent of any other servers to ensure that it can boot up and function fully by itself. I've spent plenty of time untangling nasty messes caused by an crashed NFS server holding system files for other nodes.

Cheers,

    JB

On 3/11/14 7:17 AM, Andy Riebs wrote:

We keep slurm.conf on a shared NFS partition, which may be adding some noise to the situation, but it does make it easier to propagate system-wide changes.

Andy

On 03/10/2014 09:44 PM, Williams, Kevin E. (Federal SIP) wrote:
Hey there. Are you not using /etc/slurm as the dir for the conf files? Also, if the files differ across the cluster, then the flag to ignore differences needs to be set in the files. I assume that you copied out the edited file to all nodes prior to the reconfig. Other than that, I have not seen any issues when changing the config while slurm is running and using reconfig to initiate the changes.

-----Original Message-----
From: Riebs, Andy
Sent: Monday, March 10, 2014 3:21 PM
To: slurm-dev
Subject: [slurm-dev] slurmd crashed on *some* nodes after "scontrol reconfigure"


I had edited slurm.conf to create a couple of new slurm partitions. In what appears to be a flukey coincidence, the slurmd daemons on 32 contiguous nodes apparently failed while slurmctld was reconfiguring itself.

The slurmctld log,
[2014-03-10T12:36:23.428] Processing RPC: REQUEST_RECONFIGURE from uid=0 [2014-03-10T12:36:23.484] restoring original state of nodes [2014-03-10T12:36:23.502] read_slurm_conf: backup_controller not specified.
[2014-03-10T12:36:23.779] _slurm_rpc_reconfigure_controller: completed
usec=351616

  From one of the compute nodes,
[2014-03-10T12:36:23.666] s_p_parse_file: unable to status file "/home/slurm/slurm.conf" [2014-03-10T12:36:23.666] fatal: something wrong with opening/reading conf file

Has anyone seen this before? slurm.conf is on an NFS server, so it's possible we've got a configuration error there.

In any event, I'm wondering if slurmd should retry when it sees this failure. None of the other nodes were apparently affected.

Cheers
Andy

--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP


--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Jason W. Bacon
  [email protected]

  Circumstances don't make a man:
  They reveal him.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reply via email to