Hi Brad,
I believe that the problem here is that slurmctld is doing the
equivalent of `hostname -s` which is returning "bioshock", thus telling
slurmctld that it doesn't belong here.
The easiest way to resolve the problem would be to use "bioshock" for
SLURM's ControlMachine argument; remember that all IP traffic will
actually be routed by IP adress, rather than network or host name, so
this shouldn't confuse anything.
It may be possible, instead, to set ControlAddr=master, and
ControlMachine=bioshock, but my test bed is currently down, so I can't
check this out.
Or am I missing some facet of this?
Andy
On 11/15/2011 03:42 PM, Brad Reisfeld wrote:
On 11/15/2011 10:31 AM, Andy Riebs wrote:
Brad, try disabling (commenting out) the BackupController
definition. It's not inconceivable that SLURM is getting confused by
trying to run 2 copies of the daemon on the same node.
Andy
Hi Andy,
I appreciate the suggestion.
I tried that change and get the following:
$ slurmctld -Dvvv
...
slurmctld: error: this host (bioshock) not valid controller (master
or (null))
So, it appears that this is disallowed because the ControlMachine or
BackupController is not set to be the machine hostname. How is this
normally done for a master node that has both a public and
cluster-private network interface?
Thank you.
Kind regards,
Brad
--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP