We keep slurm.conf on a shared NFS partition, which may be adding some noise to the situation, but it does make it easier to propagate system-wide changes.
Andy On 03/10/2014 09:44 PM, Williams, Kevin E. (Federal SIP) wrote:
Hey there. Are you not using /etc/slurm as the dir for the conf files? Also, if the files differ across the cluster, then the flag to ignore differences needs to be set in the files. I assume that you copied out the edited file to all nodes prior to the reconfig. Other than that, I have not seen any issues when changing the config while slurm is running and using reconfig to initiate the changes. -----Original Message----- From: Riebs, Andy Sent: Monday, March 10, 2014 3:21 PM To: slurm-dev Subject: [slurm-dev] slurmd crashed on *some* nodes after "scontrol reconfigure" I had edited slurm.conf to create a couple of new slurm partitions. In what appears to be a flukey coincidence, the slurmd daemons on 32 contiguous nodes apparently failed while slurmctld was reconfiguring itself. The slurmctld log, [2014-03-10T12:36:23.428] Processing RPC: REQUEST_RECONFIGURE from uid=0 [2014-03-10T12:36:23.484] restoring original state of nodes [2014-03-10T12:36:23.502] read_slurm_conf: backup_controller not specified. [2014-03-10T12:36:23.779] _slurm_rpc_reconfigure_controller: completed usec=351616 From one of the compute nodes, [2014-03-10T12:36:23.666] s_p_parse_file: unable to status file "/home/slurm/slurm.conf" [2014-03-10T12:36:23.666] fatal: something wrong with opening/reading conf file Has anyone seen this before? slurm.conf is on an NFS server, so it's possible we've got a configuration error there. In any event, I'm wondering if slurmd should retry when it sees this failure. None of the other nodes were apparently affected. Cheers Andy -- Andy Riebs Hewlett-Packard Company High Performance Computing +1 404 648 9024 My opinions are not necessarily those of HP
