Hmmmm I don't think this is the case - throughout the code they use
gethostname (not byname) for get the name of the particular host.
On 31/03/16 16:06, Bill Broadley wrote:
I think I found the problem and solution.
The slurm configuration 15.08 slurm [1] configuration tool mentions:
Define the hostname of the computer on which the Slurm controller and
optional backup controller will execute. You can also specify addresses of
these computers if desired (defaults to their hostnames). The IP addresses
can be either numeric IP addresses or names. Hostname values should should
not be the fully qualified domain name (e.g. use tux rather than
tux.abc.com).
ControlMachine: Master Controller Hostname
ControlAddr: Master Controller Address (optional)
So I normally fill out ControlMachine = hostname, and ControlAddr = IP address.
Turns out when configured this way the slurmctld ONLY listens to the IP address
and NOT 127.0.0.1. So telnet 127.0.0.1 6817 fails. As you might imagine
slurmdbd does as well:
[2016-03-30T19:31:46.971] debug: sending updates to MyClust at 127.0.0.1(6817)
ver 7424
[2016-03-30T19:31:46.971] debug2: Error connecting slurm stream socket at
127.0.0.1:6817: Connection refused
The strange thing is I see no documented way for slurmdbd.conf to know the IP
address of the slurm controller.
However if you look at the updated documentation at [2]:
ControlAddr
Name that ControlMachine should be referred to in establishing a
communications path. This name will be used as an argument to the
gethostbyname() function for identification. For example, "elx0000" might be
used to designate the Ethernet address for node "lx0000". By default the
ControlAddr will be identical in value to ControlMachine.
Calling gethostbyname on an IP address isn't what slurm excepts, thus the things
breaking. Seems weird to call gethostbyname for a variable called ControlAddr.
So the fix is really easy:
ControlMachine=SlurmHead
ControlAddr=SlurmHead
Or I imagine just leaving ControlAddr blank.
BTW, this seems new. We have a slurm 14.11.7 and it happily allows telnet
localhost 6817 to work, even if controladdr is set to the IP address.
Gene can you try my fix? Seems like ControlAddr should either:
A) accept an IP address
B) Be deleted since it's identical to ControlMachine
Out of curiosity, how do you tell slurmdbd which IP to use to connect to the
slurmctld? I tried sneaking in a ControlMachine= or ControlAddr which causes
and error and failure. SlurmDBD always seems to use 127.0.0.1.
[1] http://slurm.schedmd.com/configurator.easy.html
[2] http://slurm.schedmd.com/slurm.conf.html
--
New Zealand eScience Infrastructure
Centre for eResearch
The University of Auckland
e: [email protected]
p: +64 9 3737599 ext 89834 c: +64 21 840 825 f: +64 9 373 7453
w: www.nesi.org.nz