Clarifying: that was 32 nodes out of a much larger number of nodes in the cluster.


On 03/10/2014 03:20 PM, Andy Riebs wrote:

I had edited slurm.conf to create a couple of new slurm partitions. In what appears to be a flukey coincidence, the slurmd daemons on 32 contiguous nodes apparently failed while slurmctld was reconfiguring itself.

The slurmctld log,
[2014-03-10T12:36:23.428] Processing RPC: REQUEST_RECONFIGURE from uid=0
[2014-03-10T12:36:23.484] restoring original state of nodes
[2014-03-10T12:36:23.502] read_slurm_conf: backup_controller not specified. [2014-03-10T12:36:23.779] _slurm_rpc_reconfigure_controller: completed usec=351616

From one of the compute nodes,
[2014-03-10T12:36:23.666] s_p_parse_file: unable to status file "/home/slurm/slurm.conf" [2014-03-10T12:36:23.666] fatal: something wrong with opening/reading conf file

Has anyone seen this before? slurm.conf is on an NFS server, so it's possible we've got a configuration error there.

In any event, I'm wondering if slurmd should retry when it sees this failure. None of the other nodes were apparently affected.

Cheers
Andy

Reply via email to