[slurm-dev] Re: slurmd crashed on some nodes after "scontrol reconfigure"

Andy Riebs Mon, 10 Mar 2014 12:34:14 -0700

Clarifying: that was 32 nodes out of a much larger number of nodes inthe cluster.



On 03/10/2014 03:20 PM, Andy Riebs wrote:

I had edited slurm.conf to create a couple of new slurm partitions. Inwhat appears to be a flukey coincidence, the slurmd daemons on 32contiguous nodes apparently failed while slurmctld was reconfiguringitself.
The slurmctld log,
[2014-03-10T12:36:23.428] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2014-03-10T12:36:23.484] restoring original state of nodes
[2014-03-10T12:36:23.502] read_slurm_conf: backup_controller notspecified.[2014-03-10T12:36:23.779] _slurm_rpc_reconfigure_controller: completedusec=351616
From one of the compute nodes,
[2014-03-10T12:36:23.666] s_p_parse_file: unable to status file"/home/slurm/slurm.conf"[2014-03-10T12:36:23.666] fatal: something wrong with opening/readingconf file
Has anyone seen this before? slurm.conf is on an NFS server, so it'spossible we've got a configuration error there.
In any event, I'm wondering if slurmd should retry when it sees thisfailure. None of the other nodes were apparently affected.
Cheers
Andy

[slurm-dev] Re: slurmd crashed on *some* nodes after "scontrol reconfigure"

Reply via email to

[slurm-dev] Re: slurmd crashed on some nodes after "scontrol reconfigure"