|
The issue has been resolved. turns out slurm01 was first
installed with ver 2.2.3, later installed ver 2.2.4.
The 2.2.3 install used the path /usr/local (i.e.
/usr/local/etc/slurm/slurm.conf). The slurm.conf I uploaded was
from /etc/slurm/slurm.conf from the 2.2.4 install. Currently I
symlinked /usr/local/etc/slurm to /etc/slurm, and everything works
correctly.
On 06/09//2011 17:44, Moe Jette wrote:
There appears to be something very strange about your
configuration to cause a problem like this.
"slurmhpc" isn't even in in slurm01's slurm.conf file. are
slurmhpc and slurm01 virtual machines running on the same hardware
or something like that? Perhaps you have two slurm installations
on slurm01 with different paths.
Moe Jette
SchedMD LLC
Quoting Sten Wolf <[email protected]>:
Hi all,
having a very strange problem:
a single network has 2 slurm clusters:
slurm01 is responsible for single big SMP node bignode
slurmhpc is responsible for many weaker nodes node[001-200]
both can ping each other, resolve correctly (dns seems correct,
nslookup works on all)
slurmhpc is working correctly (no issues).
slurm01 will not manage bignode.
scontrol ping from bignode returns slurm01 as it's primary, but
scontrol ping from slurm01 returns slurmhpc as its primary (munge
keys
are different for the 2 clusters).
slurm01 uses accounting (slurmdbd) and it used to manage all
nodes,
but they have all been removed from slurm.conf
slurmhpc doesn't use any accounting, and it's slurm.conf doesn't
have
slurm01 or bignode.
slurm01 resolves slurm01 and slurmhpc correctly (nslookup, ping)
but
for whatever reason keeps trying to connect to slurmhpc as it's
primary.
scontrol reconfig doesn't help,
service slurm stop ; service slurm startclean doesn't help.
attached the slurm.conf for slurm01
|