[slurm-dev] RE: slurmd crashed on some nodes after "scontrol reconfigure"

Jason Bacon Tue, 11 Mar 2014 07:07:51 -0700

We do the same for our compute nodes, but I prefer keep the head nodeindependent of any other servers to ensure that it can boot up andfunction fully by itself. I've spent plenty of time untangling nastymesses caused by an crashed NFS server holding system files for other nodes.


Cheers,

    JB

On 3/11/14 7:17 AM, Andy Riebs wrote:

We keep slurm.conf on a shared NFS partition, which may be adding somenoise to the situation, but it does make it easier to propagatesystem-wide changes.
Andy

On 03/10/2014 09:44 PM, Williams, Kevin E. (Federal SIP) wrote:
Hey there. Are you not using /etc/slurm as the dir for the conffiles? Also, if the files differ across the cluster, then the flag toignore differences needs to be set in the files. I assume that youcopied out the edited file to all nodes prior to the reconfig. Otherthan that, I have not seen any issues when changing the config whileslurm is running and using reconfig to initiate the changes.
-----Original Message-----
From: Riebs, Andy
Sent: Monday, March 10, 2014 3:21 PM
To: slurm-dev
Subject: [slurm-dev] slurmd crashed on *some* nodes after "scontrolreconfigure"
I had edited slurm.conf to create a couple of new slurm partitions.In what appears to be a flukey coincidence, the slurmd daemons on 32contiguous nodes apparently failed while slurmctld was reconfiguringitself.
The slurmctld log,
[2014-03-10T12:36:23.428] Processing RPC: REQUEST_RECONFIGURE fromuid=0 [2014-03-10T12:36:23.484] restoring original state of nodes[2014-03-10T12:36:23.502] read_slurm_conf: backup_controller notspecified.
[2014-03-10T12:36:23.779] _slurm_rpc_reconfigure_controller: completed
usec=351616

  From one of the compute nodes,
[2014-03-10T12:36:23.666] s_p_parse_file: unable to status file"/home/slurm/slurm.conf"[2014-03-10T12:36:23.666] fatal: something wrong with opening/readingconf file
Has anyone seen this before? slurm.conf is on an NFS server, so it'spossible we've got a configuration error there.
In any event, I'm wondering if slurmd should retry when it sees thisfailure. None of the other nodes were apparently affected.
Cheers
Andy

--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP



--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  Jason W. Bacon
  [email protected]

  Circumstances don't make a man:
  They reveal him.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[slurm-dev] RE: slurmd crashed on *some* nodes after "scontrol reconfigure"

Reply via email to

[slurm-dev] RE: slurmd crashed on some nodes after "scontrol reconfigure"