I had edited slurm.conf to create a couple of new slurm partitions. In what appears to be a flukey coincidence, the slurmd daemons on 32 contiguous nodes apparently failed while slurmctld was reconfiguring itself.

The slurmctld log,
[2014-03-10T12:36:23.428] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2014-03-10T12:36:23.484] restoring original state of nodes
[2014-03-10T12:36:23.502] read_slurm_conf: backup_controller not specified.
[2014-03-10T12:36:23.779] _slurm_rpc_reconfigure_controller: completed usec=351616

From one of the compute nodes,
[2014-03-10T12:36:23.666] s_p_parse_file: unable to status file "/home/slurm/slurm.conf" [2014-03-10T12:36:23.666] fatal: something wrong with opening/reading conf file

Has anyone seen this before? slurm.conf is on an NFS server, so it's possible we've got a configuration error there.

In any event, I'm wondering if slurmd should retry when it sees this failure. None of the other nodes were apparently affected.

Cheers
Andy

--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1 404 648 9024
My opinions are not necessarily those of HP

Reply via email to