No, once the node "thenodename" goes down, there is no more
host resolution, it can't be pinged or ssh'ed to.

  mike


On Wed, Aug 10, 2011 at 9:21 AM, Mark A. Grondona <[email protected]> wrote:
> On Wed, 10 Aug 2011 09:13:17 -0700, Mike Schachter <[email protected]> 
> wrote:
>> I'm sure that slurm is built to handle failover like this! That's
>> why I'm so troubled by the behavior I'm seeing.
>>
>> The node configuration is relatively straightforward:
>>
>> NodeName=thenodename Procs=4 State=UNKNOWN
>> #.. other nodes defined the same way
>>
>> PartitionName=all Nodes=nodename1,nodename2,thenodename Priority=100
>> Shared=NO Default=YES
>
> Can you resolve all these hostnames on the controller node?
>
>
>
>> Could it be the State=UNKNOWN that is screwing things up? Are
>> there any other configuration options that could produce this
>> behavior? What seemed to happen was that a reconfigure command
>> was sent to the controller, right before the error messages I sent
>> previously:
>>
>> [2011-08-10T06:47:52] Reconfigure signal (SIGHUP) received
>>
>>
>>
>> On Wed, Aug 10, 2011 at 9:03 AM,  <[email protected]> wrote:
>> > There are many clusters running slurm where nodes go down daily and you are
>> > the first person to report a problem. My best guess is that your slurm.conf
>> > file is bad. What does your node configuration line(s) look like in
>> > slurm.conf?
>> >
>> > Quoting Mike Schachter <[email protected]>:
>> >
>> >> So this morning a node went down in the middle of a bunch of
>> >> jobs running, the slurm controller tried to reconfigure, and this
>> >> was the only error message we got in the log file:
>> >>
>> >> slurmctld: error: Unable to resolve "thenodename": Unknown host
>> >> slurmctld: fatal: slurm_set_addr failure on thenodename
>> >>
>> >> The controller won't even restart if I set the State=DOWN for the node
>> >> in /etc/slurm.conf. I have to manually remove the node from configuration
>> >> file in order for the controller to restart.
>> >>
>> >> This is a huge problem for us! We expected that nodes could go
>> >> down graceful failover. Any idea what's going on?
>> >>
>> >>  mike
>> >>
>> >>
>> >> On Tue, Aug 2, 2011 at 4:53 PM,  <[email protected]> wrote:
>> >>>
>> >>> Your log should say what is happening. If not, try logging on as root and
>> >>> starting the daemon by hand with lots of debugging (-v's):
>> >>> "slurmctld -Dvvvvv"
>> >>>
>> >>> Quoting Mike Schachter <[email protected]>:
>> >>>
>> >>>> Hi there,
>> >>>>
>> >>>> If we have a node down, and then restart the slurm controller,
>> >>>> for some reason slurm won't start up! Is there some way to
>> >>>> ameliorate this issue?
>> >>>>
>> >>>>  mike
>> >>>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>
>> >>
>> >
>> >
>> >
>> >
>>
>

Reply via email to