[slurm-dev] Re: slurmd crashed on some nodes after "scontrol reconfigure"

Franco Broi Mon, 10 Mar 2014 17:20:25 -0700


We have a similar setup with slurm.conf on an NFS disk and find that
doing scontrol reconfigure can cause problems, like setting all nodes in
a partition to down. Any jobs running on the nodes crash with node
failure errors.


Cheers,

On Mon, 2014-03-10 at 12:32 -0700, Andy Riebs wrote: 
> Clarifying: that was 32 nodes out of a much larger number of nodes in 
> the cluster.
> 
> 
> On 03/10/2014 03:20 PM, Andy Riebs wrote:
> >
> > I had edited slurm.conf to create a couple of new slurm partitions. In 
> > what appears to be a flukey coincidence, the slurmd daemons on 32 
> > contiguous nodes apparently failed while slurmctld was reconfiguring 
> > itself.
> >
> > The slurmctld log,
> > [2014-03-10T12:36:23.428] Processing RPC: REQUEST_RECONFIGURE from uid=0
> > [2014-03-10T12:36:23.484] restoring original state of nodes
> > [2014-03-10T12:36:23.502] read_slurm_conf: backup_controller not 
> > specified.
> > [2014-03-10T12:36:23.779] _slurm_rpc_reconfigure_controller: completed 
> > usec=351616
> >
> > From one of the compute nodes,
> > [2014-03-10T12:36:23.666] s_p_parse_file: unable to status file 
> > "/home/slurm/slurm.conf"
> > [2014-03-10T12:36:23.666] fatal: something wrong with opening/reading 
> > conf file
> >
> > Has anyone seen this before? slurm.conf is on an NFS server, so it's 
> > possible we've got a configuration error there.
> >
> > In any event, I'm wondering if slurmd should retry when it sees this 
> > failure. None of the other nodes were apparently affected.
> >
> > Cheers
> > Andy
> >

[slurm-dev] Re: slurmd crashed on *some* nodes after "scontrol reconfigure"

Reply via email to

[slurm-dev] Re: slurmd crashed on some nodes after "scontrol reconfigure"