We have a similar setup with slurm.conf on an NFS disk and find that doing scontrol reconfigure can cause problems, like setting all nodes in a partition to down. Any jobs running on the nodes crash with node failure errors.
Cheers, On Mon, 2014-03-10 at 12:32 -0700, Andy Riebs wrote: > Clarifying: that was 32 nodes out of a much larger number of nodes in > the cluster. > > > On 03/10/2014 03:20 PM, Andy Riebs wrote: > > > > I had edited slurm.conf to create a couple of new slurm partitions. In > > what appears to be a flukey coincidence, the slurmd daemons on 32 > > contiguous nodes apparently failed while slurmctld was reconfiguring > > itself. > > > > The slurmctld log, > > [2014-03-10T12:36:23.428] Processing RPC: REQUEST_RECONFIGURE from uid=0 > > [2014-03-10T12:36:23.484] restoring original state of nodes > > [2014-03-10T12:36:23.502] read_slurm_conf: backup_controller not > > specified. > > [2014-03-10T12:36:23.779] _slurm_rpc_reconfigure_controller: completed > > usec=351616 > > > > From one of the compute nodes, > > [2014-03-10T12:36:23.666] s_p_parse_file: unable to status file > > "/home/slurm/slurm.conf" > > [2014-03-10T12:36:23.666] fatal: something wrong with opening/reading > > conf file > > > > Has anyone seen this before? slurm.conf is on an NFS server, so it's > > possible we've got a configuration error there. > > > > In any event, I'm wondering if slurmd should retry when it sees this > > failure. None of the other nodes were apparently affected. > > > > Cheers > > Andy > >
