There appears to be something very strange about your configuration to
cause a problem like this.
"slurmhpc" isn't even in in slurm01's slurm.conf file. are slurmhpc
and slurm01 virtual machines running on the same hardware or something
like that? Perhaps you have two slurm installations on slurm01 with
different paths.
Moe Jette
SchedMD LLC
Quoting Sten Wolf <[email protected]>:
Hi all,
having a very strange problem:
a single network has 2 slurm clusters:
slurm01 is responsible for single big SMP node bignode
slurmhpc is responsible for many weaker nodes node[001-200]
both can ping each other, resolve correctly (dns seems correct,
nslookup works on all)
slurmhpc is working correctly (no issues).
slurm01 will not manage bignode.
scontrol ping from bignode returns slurm01 as it's primary, but
scontrol ping from slurm01 returns slurmhpc as its primary (munge keys
are different for the 2 clusters).
slurm01 uses accounting (slurmdbd) and it used to manage all nodes,
but they have all been removed from slurm.conf
slurmhpc doesn't use any accounting, and it's slurm.conf doesn't have
slurm01 or bignode.
slurm01 resolves slurm01 and slurmhpc correctly (nslookup, ping) but
for whatever reason keeps trying to connect to slurmhpc as it's
primary.
scontrol reconfig doesn't help,
service slurm stop ; service slurm startclean doesn't help.
attached the slurm.conf for slurm01