Re: [slurm-dev] slurm won't start if node is down?

jette Wed, 10 Aug 2011 10:42:57 -0700

I concur with Mark, _but_ if if putting all of your hosts into/etc/hosts is not possible, SLURM has a node state of "Future" thatmight be used for nodes that will appear later. It is intended morefor adding nodes from the cloud and will take some effort to managethe node stats in SLURM, but could be used.

On Wed, Aug 10, 2011 at 10:34 AM, Mark A. Grondona<[email protected]> wrote:
On Wed, 10 Aug 2011 10:19:56 -0700, Mike Schachter<[email protected]> wrote:
A note from the guy who set up DNS:

"the dynamic dns updates about every 5 min, so a dead server will have
its dns entry removed"

Also, looks like nothing is being returned by gethostbyname:

me@slurmctrlr:~$ perl -MSocket -e 'print (inet_ntoa(scalar
gethostbyname("thenodename")),"\n")'
Bad arg length for Socket::inet_ntoa, length is 0, should be 4 at-e line 1.
me@slurmctrlr:~$ perl -MSocket -e 'print (scalar
gethostbyname("thenodename"), "\ntest line\n")'

test line
Node hostnames need to be resolvable whether they are up or down.
I'm not sure dynamic dns is going to work very well with SLURM either.
I would put all your hosts into /etc/hosts, and don't let their IP
addresses change.

mark
On Wed, Aug 10, 2011 at 10:02 AM, Mark A. Grondona<[email protected]> wrote:> On Wed, 10 Aug 2011 09:49:50 -0700, Mike Schachter<[email protected]> wrote:
>> me@slurmctrlr:~$ nslookup thenodename
>> Server:               10.0.0.1
>> Address:      10.0.0.1#53
>>
>> ** server can't find thenodename: NXDOMAIN
>
> Sorry, that wasn't quite sufficient, except now we know "thenodename"
> isn't in DNS. Does this perl script print the IP for thenodename:
>
> perl -MSocket -e 'print (inet_ntoa(scalargethostbyname("thenodename")),"\n")'
>
> mark
>
>>
>>
>>
>> On Wed, Aug 10, 2011 at 9:40 AM, Mark A. Grondona<[email protected]> wrote:>> > On Wed, 10 Aug 2011 09:28:41 -0700, Mike Schachter<[email protected]> wrote:
>> >> No, once the node "thenodename" goes down, there is no more
>> >> host resolution, it can't be pinged or ssh'ed to.
>> >>
>> >>   mike
>> >
>> > I meant does
>> >
>> >  nslookup thenodename
>> >
>> > work?
>> >
>> > mark
>> >
>> >
>> >>
>> >> On Wed, Aug 10, 2011 at 9:21 AM, Mark A. Grondona<[email protected]> wrote:>> >> > On Wed, 10 Aug 2011 09:13:17 -0700, Mike Schachter<[email protected]> wrote:
>> >> >> I'm sure that slurm is built to handle failover like this! That's
>> >> >> why I'm so troubled by the behavior I'm seeing.
>> >> >>
>> >> >> The node configuration is relatively straightforward:
>> >> >>
>> >> >> NodeName=thenodename Procs=4 State=UNKNOWN
>> >> >> #.. other nodes defined the same way
>> >> >>
>> >> >> PartitionName=all Nodes=nodename1,nodename2,thenodenamePriority=100
>> >> >> Shared=NO Default=YES
>> >> >
>> >> > Can you resolve all these hostnames on the controller node?
>> >> >
>> >> >
>> >> >
>> >> >> Could it be the State=UNKNOWN that is screwing things up? Are
>> >> >> there any other configuration options that could produce this
>> >> >> behavior? What seemed to happen was that a reconfigure command
>> >> >> was sent to the controller, right before the error messages I sent
>> >> >> previously:
>> >> >>
>> >> >> [2011-08-10T06:47:52] Reconfigure signal (SIGHUP) received
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Aug 10, 2011 at 9:03 AM,  <[email protected]> wrote:
>> >> >> > There are many clusters running slurm where nodes godown daily and you are>> >> >> > the first person to report a problem. My best guess isthat your slurm.conf>> >> >> > file is bad. What does your node configuration line(s)look like in
>> >> >> > slurm.conf?
>> >> >> >
>> >> >> > Quoting Mike Schachter <[email protected]>:
>> >> >> >
>> >> >> >> So this morning a node went down in the middle of a bunch of
>> >> >> >> jobs running, the slurm controller tried toreconfigure, and this
>> >> >> >> was the only error message we got in the log file:
>> >> >> >>
>> >> >> >> slurmctld: error: Unable to resolve "thenodename": Unknown host
>> >> >> >> slurmctld: fatal: slurm_set_addr failure on thenodename
>> >> >> >>
>> >> >> >> The controller won't even restart if I set theState=DOWN for the node>> >> >> >> in /etc/slurm.conf. I have to manually remove the nodefrom configuration
>> >> >> >> file in order for the controller to restart.
>> >> >> >>
>> >> >> >> This is a huge problem for us! We expected that nodes could go
>> >> >> >> down graceful failover. Any idea what's going on?
>> >> >> >>
>> >> >> >>  mike
>> >> >> >>
>> >> >> >>
>> >> >> >> On Tue, Aug 2, 2011 at 4:53 PM,  <[email protected]> wrote:
>> >> >> >>>
>> >> >> >>> Your log should say what is happening. If not, trylogging on as root and
>> >> >> >>> starting the daemon by hand with lots of debugging (-v's):
>> >> >> >>> "slurmctld -Dvvvvv"
>> >> >> >>>
>> >> >> >>> Quoting Mike Schachter <[email protected]>:
>> >> >> >>>
>> >> >> >>>> Hi there,
>> >> >> >>>>
>> >> >> >>>> If we have a node down, and then restart the slurmcontroller,
>> >> >> >>>> for some reason slurm won't start up! Is there some way to
>> >> >> >>>> ameliorate this issue?
>> >> >> >>>>
>> >> >> >>>>  mike
>> >> >> >>>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >
>>
>

Re: [slurm-dev] slurm won't start if node is down?

Reply via email to