Re: [slurm-dev] slurm won't start if node is down?

Mark A. Grondona Wed, 10 Aug 2011 10:34:55 -0700

On Wed, 10 Aug 2011 10:19:56 -0700, Mike Schachter <[email protected]> 
wrote:
> A note from the guy who set up DNS:
> 
> "the dynamic dns updates about every 5 min, so a dead server will have
> its dns entry removed"
> 
> Also, looks like nothing is being returned by gethostbyname:
> 
> me@slurmctrlr:~$ perl -MSocket -e 'print (inet_ntoa(scalar
> gethostbyname("thenodename")),"\n")'
> Bad arg length for Socket::inet_ntoa, length is 0, should be 4 at -e line 1.
> 
> me@slurmctrlr:~$ perl -MSocket -e 'print (scalar
> gethostbyname("thenodename"), "\ntest line\n")'
> 
> test line


Node hostnames need to be resolvable whether they are up or down.
I'm not sure dynamic dns is going to work very well with SLURM either.
I would put all your hosts into /etc/hosts, and don't let their IP
addresses change.

mark

 
> 
> 
> 
> On Wed, Aug 10, 2011 at 10:02 AM, Mark A. Grondona <[email protected]> wrote:
> > On Wed, 10 Aug 2011 09:49:50 -0700, Mike Schachter 
> > <[email protected]> wrote:
> >> me@slurmctrlr:~$ nslookup thenodename
> >> Server:               10.0.0.1
> >> Address:      10.0.0.1#53
> >>
> >> ** server can't find thenodename: NXDOMAIN
> >
> > Sorry, that wasn't quite sufficient, except now we know "thenodename"
> > isn't in DNS. Does this perl script print the IP for thenodename:
> >
> >  perl -MSocket -e 'print (inet_ntoa(scalar 
> > gethostbyname("thenodename")),"\n")'
> >
> > mark
> >
> >>
> >>
> >>
> >> On Wed, Aug 10, 2011 at 9:40 AM, Mark A. Grondona <[email protected]> 
> >> wrote:
> >> > On Wed, 10 Aug 2011 09:28:41 -0700, Mike Schachter 
> >> > <[email protected]> wrote:
> >> >> No, once the node "thenodename" goes down, there is no more
> >> >> host resolution, it can't be pinged or ssh'ed to.
> >> >>
> >> >>   mike
> >> >
> >> > I meant does
> >> >
> >> >  nslookup thenodename
> >> >
> >> > work?
> >> >
> >> > mark
> >> >
> >> >
> >> >>
> >> >> On Wed, Aug 10, 2011 at 9:21 AM, Mark A. Grondona <[email protected]> 
> >> >> wrote:
> >> >> > On Wed, 10 Aug 2011 09:13:17 -0700, Mike Schachter 
> >> >> > <[email protected]> wrote:
> >> >> >> I'm sure that slurm is built to handle failover like this! That's
> >> >> >> why I'm so troubled by the behavior I'm seeing.
> >> >> >>
> >> >> >> The node configuration is relatively straightforward:
> >> >> >>
> >> >> >> NodeName=thenodename Procs=4 State=UNKNOWN
> >> >> >> #.. other nodes defined the same way
> >> >> >>
> >> >> >> PartitionName=all Nodes=nodename1,nodename2,thenodename Priority=100
> >> >> >> Shared=NO Default=YES
> >> >> >
> >> >> > Can you resolve all these hostnames on the controller node?
> >> >> >
> >> >> >
> >> >> >
> >> >> >> Could it be the State=UNKNOWN that is screwing things up? Are
> >> >> >> there any other configuration options that could produce this
> >> >> >> behavior? What seemed to happen was that a reconfigure command
> >> >> >> was sent to the controller, right before the error messages I sent
> >> >> >> previously:
> >> >> >>
> >> >> >> [2011-08-10T06:47:52] Reconfigure signal (SIGHUP) received
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> On Wed, Aug 10, 2011 at 9:03 AM,  <[email protected]> wrote:
> >> >> >> > There are many clusters running slurm where nodes go down daily 
> >> >> >> > and you are
> >> >> >> > the first person to report a problem. My best guess is that your 
> >> >> >> > slurm.conf
> >> >> >> > file is bad. What does your node configuration line(s) look like in
> >> >> >> > slurm.conf?
> >> >> >> >
> >> >> >> > Quoting Mike Schachter <[email protected]>:
> >> >> >> >
> >> >> >> >> So this morning a node went down in the middle of a bunch of
> >> >> >> >> jobs running, the slurm controller tried to reconfigure, and this
> >> >> >> >> was the only error message we got in the log file:
> >> >> >> >>
> >> >> >> >> slurmctld: error: Unable to resolve "thenodename": Unknown host
> >> >> >> >> slurmctld: fatal: slurm_set_addr failure on thenodename
> >> >> >> >>
> >> >> >> >> The controller won't even restart if I set the State=DOWN for the 
> >> >> >> >> node
> >> >> >> >> in /etc/slurm.conf. I have to manually remove the node from 
> >> >> >> >> configuration
> >> >> >> >> file in order for the controller to restart.
> >> >> >> >>
> >> >> >> >> This is a huge problem for us! We expected that nodes could go
> >> >> >> >> down graceful failover. Any idea what's going on?
> >> >> >> >>
> >> >> >> >>  mike
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Tue, Aug 2, 2011 at 4:53 PM,  <[email protected]> wrote:
> >> >> >> >>>
> >> >> >> >>> Your log should say what is happening. If not, try logging on as 
> >> >> >> >>> root and
> >> >> >> >>> starting the daemon by hand with lots of debugging (-v's):
> >> >> >> >>> "slurmctld -Dvvvvv"
> >> >> >> >>>
> >> >> >> >>> Quoting Mike Schachter <[email protected]>:
> >> >> >> >>>
> >> >> >> >>>> Hi there,
> >> >> >> >>>>
> >> >> >> >>>> If we have a node down, and then restart the slurm controller,
> >> >> >> >>>> for some reason slurm won't start up! Is there some way to
> >> >> >> >>>> ameliorate this issue?
> >> >> >> >>>>
> >> >> >> >>>>  mike
> >> >> >> >>>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >
> >> >>
> >> >
> >>
> >
>

Re: [slurm-dev] slurm won't start if node is down?

Reply via email to