A note from the guy who set up DNS:
"the dynamic dns updates about every 5 min, so a dead server will have
its dns entry removed"
Also, looks like nothing is being returned by gethostbyname:
me@slurmctrlr:~$ perl -MSocket -e 'print (inet_ntoa(scalar
gethostbyname("thenodename")),"\n")'
Bad arg length for Socket::inet_ntoa, length is 0, should be 4 at -e line 1.
me@slurmctrlr:~$ perl -MSocket -e 'print (scalar
gethostbyname("thenodename"), "\ntest line\n")'
test line
On Wed, Aug 10, 2011 at 10:02 AM, Mark A. Grondona <[email protected]> wrote:
> On Wed, 10 Aug 2011 09:49:50 -0700, Mike Schachter <[email protected]>
> wrote:
>> me@slurmctrlr:~$ nslookup thenodename
>> Server: 10.0.0.1
>> Address: 10.0.0.1#53
>>
>> ** server can't find thenodename: NXDOMAIN
>
> Sorry, that wasn't quite sufficient, except now we know "thenodename"
> isn't in DNS. Does this perl script print the IP for thenodename:
>
> perl -MSocket -e 'print (inet_ntoa(scalar
> gethostbyname("thenodename")),"\n")'
>
> mark
>
>>
>>
>>
>> On Wed, Aug 10, 2011 at 9:40 AM, Mark A. Grondona <[email protected]> wrote:
>> > On Wed, 10 Aug 2011 09:28:41 -0700, Mike Schachter
>> > <[email protected]> wrote:
>> >> No, once the node "thenodename" goes down, there is no more
>> >> host resolution, it can't be pinged or ssh'ed to.
>> >>
>> >> mike
>> >
>> > I meant does
>> >
>> > nslookup thenodename
>> >
>> > work?
>> >
>> > mark
>> >
>> >
>> >>
>> >> On Wed, Aug 10, 2011 at 9:21 AM, Mark A. Grondona <[email protected]>
>> >> wrote:
>> >> > On Wed, 10 Aug 2011 09:13:17 -0700, Mike Schachter
>> >> > <[email protected]> wrote:
>> >> >> I'm sure that slurm is built to handle failover like this! That's
>> >> >> why I'm so troubled by the behavior I'm seeing.
>> >> >>
>> >> >> The node configuration is relatively straightforward:
>> >> >>
>> >> >> NodeName=thenodename Procs=4 State=UNKNOWN
>> >> >> #.. other nodes defined the same way
>> >> >>
>> >> >> PartitionName=all Nodes=nodename1,nodename2,thenodename Priority=100
>> >> >> Shared=NO Default=YES
>> >> >
>> >> > Can you resolve all these hostnames on the controller node?
>> >> >
>> >> >
>> >> >
>> >> >> Could it be the State=UNKNOWN that is screwing things up? Are
>> >> >> there any other configuration options that could produce this
>> >> >> behavior? What seemed to happen was that a reconfigure command
>> >> >> was sent to the controller, right before the error messages I sent
>> >> >> previously:
>> >> >>
>> >> >> [2011-08-10T06:47:52] Reconfigure signal (SIGHUP) received
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Aug 10, 2011 at 9:03 AM, <[email protected]> wrote:
>> >> >> > There are many clusters running slurm where nodes go down daily and
>> >> >> > you are
>> >> >> > the first person to report a problem. My best guess is that your
>> >> >> > slurm.conf
>> >> >> > file is bad. What does your node configuration line(s) look like in
>> >> >> > slurm.conf?
>> >> >> >
>> >> >> > Quoting Mike Schachter <[email protected]>:
>> >> >> >
>> >> >> >> So this morning a node went down in the middle of a bunch of
>> >> >> >> jobs running, the slurm controller tried to reconfigure, and this
>> >> >> >> was the only error message we got in the log file:
>> >> >> >>
>> >> >> >> slurmctld: error: Unable to resolve "thenodename": Unknown host
>> >> >> >> slurmctld: fatal: slurm_set_addr failure on thenodename
>> >> >> >>
>> >> >> >> The controller won't even restart if I set the State=DOWN for the
>> >> >> >> node
>> >> >> >> in /etc/slurm.conf. I have to manually remove the node from
>> >> >> >> configuration
>> >> >> >> file in order for the controller to restart.
>> >> >> >>
>> >> >> >> This is a huge problem for us! We expected that nodes could go
>> >> >> >> down graceful failover. Any idea what's going on?
>> >> >> >>
>> >> >> >> mike
>> >> >> >>
>> >> >> >>
>> >> >> >> On Tue, Aug 2, 2011 at 4:53 PM, <[email protected]> wrote:
>> >> >> >>>
>> >> >> >>> Your log should say what is happening. If not, try logging on as
>> >> >> >>> root and
>> >> >> >>> starting the daemon by hand with lots of debugging (-v's):
>> >> >> >>> "slurmctld -Dvvvvv"
>> >> >> >>>
>> >> >> >>> Quoting Mike Schachter <[email protected]>:
>> >> >> >>>
>> >> >> >>>> Hi there,
>> >> >> >>>>
>> >> >> >>>> If we have a node down, and then restart the slurm controller,
>> >> >> >>>> for some reason slurm won't start up! Is there some way to
>> >> >> >>>> ameliorate this issue?
>> >> >> >>>>
>> >> >> >>>> mike
>> >> >> >>>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>