Re: [ClusterLabs] Antw: Antw: [EXT] Stopping a server failed and fenced, despite disabling stop timeout

Digimer Mon, 18 Jan 2021 11:04:07 -0800

On 2021-01-18 3:31 a.m., Ulrich Windl wrote:
>>>> "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de> schrieb am 18.01.2021 um
> 09:28 in Nachricht <60054697020000a10003e...@gwsmtp.uni-regensburg.de>:
>>>>> Digimer <li...@alteeve.ca> schrieb am 18.01.2021 um 03:11 in Nachricht
>> <816a4d1e-a92d-2a4c-b1a0-cf4353e3f...@alteeve.ca>:
>>> Hi all,
>>>
>>>   Mind the slew of questions, well into testing now and finding lots of
>>> issues. This one is two questions... :)
>>>
>>>   I set a server to be unamaged in pacemaker while the server was
>>> running. Then I tried to remove the resource, and it refused saying it
>>> couldn't stop it, and to use '--force'. So I did, and the node got
>>> fenced. Now, the resource was setup with;
>>
>> My guess is you shouldn't do it that way: Why not stop the resource,
>> unconfigure it in the cluster, then start it manually?
>>
>>>
>>> pcs resource create srv07-el6 ocf:alteeve:server name="srv07-el6" \
>>>  meta allow-migrate="true" target-role="started" \
>>>  op monitor interval="60" start timeout="INFINITY" \
>>>  on-fail="block" stop timeout="INFINITY" on-fail="block" \
>>>  migrate_to timeout="INFINITY"
>>>
>>>   I would have expected the 'stop timeout="INFINITY" on-fail="block"' to
>>> prevent fencing if the server failed to stop (question 1) and that if a
>>> resource was unmanaged, that the resource wouldn't even try to stop
>>> (question 2).
>>>
>>>   Can someone help me understand what happened here?
>>
>> Fencing reason was " srv01-test_stop_0 process (PID 113779) timed out".
>>
>> Did have a failutre before your actions? The logs indicate such it seems:
> 
> Sorry: "Did you have a failure before your actions?"


I had, yes, but I cleared it.

I'm intentionally doing "weird things" to see how the system reacts, and
when things go bad (like this), what can be done to make the system more
resilient.

If I've learned anything in 10 years of HA, it's that people will do all
the things you think they shouldn't do. So I'm trying to do them before
they do and learn how to mitigate as much as possible.

>> "Clearing failure of srv01-test on el8-a01n02 because resource  parameters
>> have changed"
>>
>> Haveing the cluster in a clean state before configuring it highly desirable
>> IMHO. I use this command frequently to check: "crm_mon -1Arfj"
>>
>> The logs should help to explain!
>>
>> Regards,
>> Ulrich
> 
> 
> _______________________________________________
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
> 
> ClusterLabs home: https://www.clusterlabs.org/
> 


-- 
Digimer
Papers and Projects: https://alteeve.com/w/
"I am, somehow, less interested in the weight and convolutions of
Einstein’s brain than in the near certainty that people of equal talent
have lived and died in cotton fields and sweatshops." - Stephen Jay Gould
_______________________________________________
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Antw: [EXT] Stopping a server failed and fenced, despite disabling stop timeout

Reply via email to