[slurm-dev] RE: slurmd crashed on some nodes after "scontrol reconfigure"

Andy Riebs Tue, 11 Mar 2014 05:18:01 -0700

We keep slurm.conf on a shared NFS partition, which may be adding somenoise to the situation, but it does make it easier to propagatesystem-wide changes.


Andy

On 03/10/2014 09:44 PM, Williams, Kevin E. (Federal SIP) wrote:

Hey there.  Are you not using /etc/slurm as the dir for the conf files? Also, 
if the files differ across the cluster, then the flag to ignore differences 
needs to be set in the files.  I assume that you copied out the edited file to 
all nodes prior to the reconfig.  Other than that, I have not seen any issues 
when changing the config while slurm is running and using reconfig to initiate 
the changes.

-----Original Message-----
From: Riebs, Andy
Sent: Monday, March 10, 2014 3:21 PM
To: slurm-dev
Subject: [slurm-dev] slurmd crashed on *some* nodes after "scontrol reconfigure"


I had edited slurm.conf to create a couple of new slurm partitions. In what 
appears to be a flukey coincidence, the slurmd daemons on 32 contiguous nodes 
apparently failed while slurmctld was reconfiguring itself.

The slurmctld log,
[2014-03-10T12:36:23.428] Processing RPC: REQUEST_RECONFIGURE from uid=0 
[2014-03-10T12:36:23.484] restoring original state of nodes 
[2014-03-10T12:36:23.502] read_slurm_conf: backup_controller not specified.
[2014-03-10T12:36:23.779] _slurm_rpc_reconfigure_controller: completed
usec=351616

  From one of the compute nodes,
[2014-03-10T12:36:23.666] s_p_parse_file: unable to status file 
"/home/slurm/slurm.conf"
[2014-03-10T12:36:23.666] fatal: something wrong with opening/reading conf file

Has anyone seen this before? slurm.conf is on an NFS server, so it's possible 
we've got a configuration error there.

In any event, I'm wondering if slurmd should retry when it sees this failure. 
None of the other nodes were apparently affected.

Cheers
Andy

--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP

[slurm-dev] RE: slurmd crashed on *some* nodes after "scontrol reconfigure"

Reply via email to

[slurm-dev] RE: slurmd crashed on some nodes after "scontrol reconfigure"