>>> Digimer <li...@alteeve.ca> schrieb am 04.03.2021 um 06:38 in Nachricht
<41edb705-6b8a-2221-fc8b-a367aac98...@alteeve.ca>:
> On 2021-03-03 6:53 p.m., Eric Robinson wrote:
>> 
>>> -----Original Message-----
>>> From: Users <users-boun...@clusterlabs.org> On Behalf Of Ulrich Windl
>>> Sent: Wednesday, March 3, 2021 12:57 AM
>>> To: users@clusterlabs.org 
>>> Subject: [ClusterLabs] Antw: RE: Antw: [EXT] Re: "Error: unable to fence
>>> '001db02a'" but It got fenced anyway
>>>
>>>>>> Eric Robinson <eric.robin...@psmnv.com> schrieb am 02.03.2021 um
>>>>>> 19:26 in
>>> Nachricht
>>> <SA2PR03MB58847E37845FC6C92BC3007EFA999@SA2PR03MB5884.namprd0 
>>> 3.prod.outlook.com>
>>>
>>>>>  -----Original Message-----
>>>>> From: Users <users-boun...@clusterlabs.org> On Behalf Of Digimer
>>>>> Sent: Monday, March 1, 2021 11:02 AM
>>>>> To: Cluster Labs - All topics related to open-source clustering
>>>>> welcomed <users@clusterlabs.org>; Ulrich Windl
>>>>> <ulrich.wi...@rz.uni-regensburg.de>
>>>>> Subject: Re: [ClusterLabs] Antw: [EXT] Re: "Error: unable to fence
>>>> '001db02a'"
>>> ...
>>>>>>> Cloud fencing usually requires a higher timeout (20s reported here).
>>>>>>>
>>>>>>> Microsoft seems to suggest the following setup:
>>>>>>>
>>>>>>> # pcs property set stonith‑timeout=900
>>>>>>
>>>>>> But doesn't that mean the other node waits 15 minutes after stonith
>>>>>> until it performs the first post-stonith action?
>>>>>
>>>>> No, it means that if there is no reply by then, the fence has failed.
>>>>> If
>>> the
>>>>> fence happens sooner, and the caller is told this, recovery begins
>>>>> very
>>>> shortly
>>>>> after.
>>>
>>> How would the fencing be confirmed? I don't know.
>>>
>>>
>>>>>
>>>>
>>>> Interesting. Since users often report application failure within 1-3
>>>> minutes
>>>
>>>> and may engineers begin investigating immediately, a technician could
>>>> end up
>>>
>>>> connecting to a cluster node after the stonith command was called, and
>>>> could
>>>
>>>> conceivably bring a failed node back up manually, only to have Azure
>>>> finally get around to shooting it in the head. I don't suppose there's
>>>> a way to abort/cancel a STONITH operation that is in progress?
>>>
>>> I think you have to decide: Let the cluster handle the problem, or let
the
>>> admin handle the problem, but preferrably not both.
>>> I also think you cannot cancel a STONITH; you can only confirm it.
>>>
>>> Regards,
>>> Ulrich
>>>
>> 
>> Standing by and letting the cluster handle the problem is a hard pill to 
> swallow when a technician could resolve things and bring services back up 
> sooner, but I get your point.
> 
> In all my years, I've learned to trust carefully reviewed code to do the
> right thing over humans. Outside HA specialists, most people setup HA
> and forget it, often for months or even years. The idea that they would
> remember what to do, accurately, while there is also a major outage is,
> to me, a much harder pill to swallow.
> 
> A well tested HA cluster that is designed properly will have a far, far
> higher chance of quickly and efficiently recover services during an
> outage. This is extra true if the problem arises after beer o'clock on a
> Friday evening of the first day of vacation.
> 
> Be careful not to confuse the effort needed to do the initial, proper
> build and testing with the reliability of the system. The more thorough
> you are during building, the more reliable your system over time.

I agree. Maybe some of those actions are "symptom fixing" when actual "cause
fixing" is needed.
The cluster also can only do symptom fixing, but it is expected that the admin
finds and fixes expectable error causes.
That's how HA really works: configure, test, repeat until things work as
expected
In between, the cluster acs as instructed, may it be good or bad...

Regards,
Ulrich


> 
> -- 
> Digimer
> Papers and Projects: https://alteeve.com/w/ 
> "I am, somehow, less interested in the weight and convolutions of
> Einstein’s brain than in the near certainty that people of equal talent
> have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users 
> 
> ClusterLabs home: https://www.clusterlabs.org/ 



_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Reply via email to