Make sure you don't have

DebugFlags=NO_CONF_HASH

in your slurm.conf.

Then in your slurmctld.log verify you don't see any messages like

error: Node snowflake0 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.

I am guessing the slurm.conf file on your nodes may be insync, but perhaps the slurmd on the troubled nodes may be running with an old version.

Danny

On 02/04/2015 10:36 AM, Ulf Markwardt wrote:
Dear Moe and Danny,

I would also check that your configured addresses for the nodes in
slurm.conf are correct (e.g. NodeName and NodeAddr match in slurm.conf).

Quoting Danny Auble <[email protected]>:
Ulf, I would verify the slurm.conf is the same in each node.

after initial confusion with diverging slurm.conf we have provisioning tool which simply assures that our config files are synchron. (Apart from different energy sensors, GPU things etc). They are always updated at the start of the slurm daemon.

I have checked right now, there are no differences in hostnames, partitions, addresses, etc.

Best regards,
ulf

Reply via email to