I'm sure that slurm is built to handle failover like this! That's
why I'm so troubled by the behavior I'm seeing.

The node configuration is relatively straightforward:

NodeName=thenodename Procs=4 State=UNKNOWN
#.. other nodes defined the same way

PartitionName=all Nodes=nodename1,nodename2,thenodename Priority=100
Shared=NO Default=YES

Could it be the State=UNKNOWN that is screwing things up? Are
there any other configuration options that could produce this
behavior? What seemed to happen was that a reconfigure command
was sent to the controller, right before the error messages I sent
previously:

[2011-08-10T06:47:52] Reconfigure signal (SIGHUP) received



On Wed, Aug 10, 2011 at 9:03 AM,  <[email protected]> wrote:
> There are many clusters running slurm where nodes go down daily and you are
> the first person to report a problem. My best guess is that your slurm.conf
> file is bad. What does your node configuration line(s) look like in
> slurm.conf?
>
> Quoting Mike Schachter <[email protected]>:
>
>> So this morning a node went down in the middle of a bunch of
>> jobs running, the slurm controller tried to reconfigure, and this
>> was the only error message we got in the log file:
>>
>> slurmctld: error: Unable to resolve "thenodename": Unknown host
>> slurmctld: fatal: slurm_set_addr failure on thenodename
>>
>> The controller won't even restart if I set the State=DOWN for the node
>> in /etc/slurm.conf. I have to manually remove the node from configuration
>> file in order for the controller to restart.
>>
>> This is a huge problem for us! We expected that nodes could go
>> down graceful failover. Any idea what's going on?
>>
>>  mike
>>
>>
>> On Tue, Aug 2, 2011 at 4:53 PM,  <[email protected]> wrote:
>>>
>>> Your log should say what is happening. If not, try logging on as root and
>>> starting the daemon by hand with lots of debugging (-v's):
>>> "slurmctld -Dvvvvv"
>>>
>>> Quoting Mike Schachter <[email protected]>:
>>>
>>>> Hi there,
>>>>
>>>> If we have a node down, and then restart the slurm controller,
>>>> for some reason slurm won't start up! Is there some way to
>>>> ameliorate this issue?
>>>>
>>>>  mike
>>>>
>>>
>>>
>>>
>>>
>>
>>
>
>
>
>

Reply via email to