First time, long time...


We've recently had slurmctld fail twice after editing slurm.conf. Both failures 
occurred after upgrading to v15.08.11 (from v14), so I'm wondering if there's a 
change in behavior, though that very well could be a red herring. One edit was 
changing a flag in the SelectTypeParameters= parameter, and the other was 
modifying a range of nodes in a NodeName= section. We've identified two 
possibly related errors from /var/log and are hoping for feedback from the 
community. Thanks!



First error



There are a few instances of the error below in /var/log/messages.



# grep -m 1 "hostnames" /var/log/messages

Jun 16 13:39:09 hpc-hn1 slurmctld[8502]: fatal: read_slurm_conf: hostnames 
inconsistency detected



Most of these errors occurred at the same time as one of the crashes. I did 
find an old-ish bug report (https://bugs.schedmd.com/show_bug.cgi?id=805) that 
references this error, with the reply: "If hosts are added or removed from 
slurm.conf the controller must be restarted, if not the code detects the 
inconsistency and the slurmctld aborts".



So, what is the method that SLURM uses to identify if there's consistency? Does 
every node defined in slurm.conf have to be actively checking in with slurmd 
running? We do have a few decommissioned nodes listed in the DownNodes= 
section. We haven't had this issue in the past with editing NodeName= sections.



Second error

All of our compute nodes have the same slurm.conf file by way of a symlink for 
/etc/slurm to a GPFS path:



# clush -b -w cn[01-312] "ls -ld /etc/slurm | awk '{print \$(NF-2)\" 
\"\$(NF-1)\" \"\$NF}'" 2>/dev/null

---------------

cn[01-05,07-34,53-60,65-186,188-253,255-268,270-279,281-312] (285)

---------------

/etc/slurm -> /gpfs/gpfs1/slurm/etc/slurm



However, we very frequently get an error message about compute nodes having 
different versions of slurm.conf than the controller node.



# grep "different slurm.conf" /var/log/slurm/Slurmctld.log | wc -l

4727



And that's just since 3AM today...



# grep -m 1 "different slurm.conf" /var/log/slurm/Slurmctld.log

[2016-06-17T03:42:04.219] error: Node cn301 appears to have a different 
slurm.conf than the slurmctld.  This could cause issues with communication and 
functionality.  Please review both files and make sure they are the same.  If 
this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.



Does anyone have experience with this error, and is it safe to ignore or 
suppress via the recommended DebugFlags?



Thank you!



--

Ed Swindelles

University of Connecticut

Reply via email to