Make sure you don't have
DebugFlags=NO_CONF_HASH
in your slurm.conf.
Then in your slurmctld.log verify you don't see any messages like
error: Node snowflake0 appears to have a different slurm.conf than the
slurmctld. This could cause issues with communication and
functionality. Please review both files and make sure they are the
same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in
your slurm.conf.
I am guessing the slurm.conf file on your nodes may be insync, but
perhaps the slurmd on the troubled nodes may be running with an old version.
Danny
On 02/04/2015 10:36 AM, Ulf Markwardt wrote:
Dear Moe and Danny,
I would also check that your configured addresses for the nodes in
slurm.conf are correct (e.g. NodeName and NodeAddr match in slurm.conf).
Quoting Danny Auble <[email protected]>:
Ulf, I would verify the slurm.conf is the same in each node.
after initial confusion with diverging slurm.conf we have provisioning
tool which simply assures that our config files are synchron. (Apart
from different energy sensors, GPU things etc). They are always
updated at the start of the slurm daemon.
I have checked right now, there are no differences in hostnames,
partitions, addresses, etc.
Best regards,
ulf