Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

Derek Wuelfrath Wed, 15 Nov 2017 12:41:21 -0800

And just to make sure, I’m not the kind of person who stick to the “we always 
did it that way…” ;)
Just trying to figure out why it suddenly breaks.


-derek

--
Derek Wuelfrath
[email protected] <mailto:[email protected]> :: +1.514.447.4918 (x110) 
:: +1.866.353.6153 (x110)
Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), 
PacketFence (www.packetfence.org <https://www.packetfence.org/>) and Fingerbank 
(www.fingerbank.org <https://www.fingerbank.org/>)

> On Nov 15, 2017, at 15:30, Derek Wuelfrath <[email protected]> wrote:
> 
> I agree. Thing is, we have this kind of setup deployed largely and since a 
> while. Never ran into any issue.
> Not sure if something changed in Corosync/Pacemaker code or way of dealing 
> with systemd resources.
> 
> As said, without a systemd resource, everything just work as it should… 100% 
> of the time
> As soon as a systemd resource comes in, it breaks.
> 
> -derek
> 
> --
> Derek Wuelfrath
> [email protected] <mailto:[email protected]> :: +1.514.447.4918 
> (x110) :: +1.866.353.6153 (x110)
> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), 
> PacketFence (www.packetfence.org <https://www.packetfence.org/>) and 
> Fingerbank (www.fingerbank.org <https://www.fingerbank.org/>)
> 
>> On Nov 14, 2017, at 23:03, Digimer <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Quorum doesn't prevent split-brains, stonith (fencing) does. 
>> 
>> https://www.alteeve.com/w/The_2-Node_Myth 
>> <https://www.alteeve.com/w/The_2-Node_Myth>
>> 
>> There is no way to use quorum-only to avoid a potential split-brain. You 
>> might be able to make it less likely with enough effort, but never prevent 
>> it.
>> 
>> digimer
>> 
>> On 2017-11-14 10:45 PM, Garima wrote:
>>> Hello All,
>>>  
>>> Split-brain situation occurs due to there is a drop in quorum which leads 
>>> to Spilt-brain situation and status information is not exchanged between 
>>> both two nodes of the cluster. 
>>> This can be avoided if quorum communicates between both the nodes.
>>> I have checked the code. In My opinion these files need to be updated 
>>> (quorum.py/stonith.py) to avoid the spilt-brain situation to maintain 
>>> Active-Passive configuration.
>>>  
>>> Regards,
>>> Garima
>>>  
>>> From: Derek Wuelfrath [mailto:[email protected] 
>>> <mailto:[email protected]>] 
>>> Sent: 13 November 2017 20:55
>>> To: Cluster Labs - All topics related to open-source clustering welcomed 
>>> <[email protected]> <mailto:[email protected]>
>>> Subject: Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd 
>>> resource
>>>  
>>> Hello Ken !
>>>  
>>> Make sure that the systemd service is not enabled. If pacemaker is
>>> managing a service, systemd can't also be trying to start and stop it.
>>>  
>>> It is not. I made sure of this in the first place :)
>>>  
>>> Beyond that, the question is what log messages are there from around
>>> the time of the issue (on both nodes).
>>>  
>>> Well, that’s the thing. There is not much log messages telling what is 
>>> actually happening. The ’systemd’ resource is not even trying to start 
>>> (nothing in either log for that resource). Here are the logs from my last 
>>> attempt:
>>> Scenario:
>>> - Services were running on ‘pancakeFence2’. DRBD was synced and connected
>>> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’
>>> - After ‘pancakeFence2’ comes back, services are running just fine on 
>>> ‘pancakeFence1’ but DRBD is in Standalone due to split-brain
>>>  
>>> Logs for pancakeFence1: https://pastebin.com/dVSGPP78 
>>> <https://pastebin.com/dVSGPP78>
>>> Logs for pancakeFence2: https://pastebin.com/at8qPkHE 
>>> <https://pastebin.com/at8qPkHE>
>>>  
>>> It really looks like the status checkup mechanism of corosync/pacemaker for 
>>> a systemd resource force the resource to “start” and therefore, start the 
>>> ones above that resource in the group (DRBD in instance).
>>> This does not happen for a regular OCF resource (IPaddr2 per example)
>>> 
>>> Cheers!
>>> -dw
>>>  
>>> --
>>> Derek Wuelfrath
>>> [email protected] <mailto:[email protected]> :: +1.514.447.4918 
>>> (x110) :: +1.866.353.6153 (x110)
>>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), 
>>> PacketFence (www.packetfence.org <https://www.packetfence.org/>) and 
>>> Fingerbank (www.fingerbank.org <https://www.fingerbank.org/>)
>>> 
>>> 
>>> On Nov 10, 2017, at 11:39, Ken Gaillot <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>  
>>> On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote:
>>> 
>>> Hello there,
>>> 
>>> First post here but following since a while!
>>> 
>>> Welcome!
>>> 
>>> 
>>> 
>>> Here’s my issue,
>>> we are putting in place and running this type of cluster since a
>>> while and never really encountered this kind of problem.
>>> 
>>> I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD
>>> along with different other resources. Part of theses resources are
>>> some systemd resources… this is the part where things are “breaking”.
>>> 
>>> Having a two servers cluster running only DRBD or DRBD with an OCF
>>> ipaddr2 resource (Cluser IP in instance) works just fine. I can
>>> easily move from one node to the other without any issue.
>>> As soon as I add a systemd resource to the resource group, things are
>>> breaking. Moving from one node to the other using standby mode works
>>> just fine but as soon as Corosync / Pacemaker restart involves
>>> polling of a systemd resource, it seems like it is trying to start
>>> the whole resource group and therefore, create a split-brain of the
>>> DRBD resource.
>>> 
>>> My first two suggestions would be:
>>> 
>>> Make sure that the systemd service is not enabled. If pacemaker is
>>> managing a service, systemd can't also be trying to start and stop it.
>>> 
>>> Fencing is the only way pacemaker can resolve split-brains and certain
>>> other situations, so that will help in the recovery.
>>> 
>>> Beyond that, the question is what log messages are there from around
>>> the time of the issue (on both nodes).
>>> 
>>> 
>>> 
>>> 
>>> It is the best explanation / description of the situation that I can
>>> give. If it need any clarification, examples, … I am more than open
>>> to share them.
>>> 
>>> Any guidance would be appreciated :)
>>> 
>>> Here’s the output of a ‘pcs config’
>>> 
>>> https://pastebin.com/1TUvZ4X9 <https://pastebin.com/1TUvZ4X9>
>>> 
>>> Cheers!
>>> -dw
>>> 
>>> --
>>> Derek Wuelfrath
>>> [email protected] <mailto:[email protected]> :: +1.514.447.4918 
>>> (x110) :: +1.866.353.6153
>>> (x110)
>>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <http://www.sogo.nu/>), 
>>> PacketFence
>>> (www.packetfence.org <http://www.packetfence.org/>) and Fingerbank 
>>> (www.fingerbank.org <http://www.fingerbank.org/>)
>>> -- 
>>> Ken Gaillot <[email protected] <mailto:[email protected]>>
>>> 
>>> _______________________________________________
>>> Users mailing list: [email protected] <mailto:[email protected]>
>>> http://lists.clusterlabs.org/mailman/listinfo/users 
>>> <http://lists.clusterlabs.org/mailman/listinfo/users>
>>> 
>>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>>>  
>>> 
>>> 
>>> _______________________________________________
>>> Users mailing list: [email protected] <mailto:[email protected]>
>>> http://lists.clusterlabs.org/mailman/listinfo/users 
>>> <http://lists.clusterlabs.org/mailman/listinfo/users>
>>> 
>>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
>> 
>> -- 
>> Digimer
>> Papers and Projects: https://alteeve.com/w/ <https://alteeve.com/w/>
>> "I am, somehow, less interested in the weight and convolutions of Einstein’s 
>> brain than in the near certainty that people of equal talent have lived and 
>> died in cotton fields and sweatshops." - Stephen Jay Gould
>> _______________________________________________
>> Users mailing list: [email protected] <mailto:[email protected]>
>> http://lists.clusterlabs.org/mailman/listinfo/users 
>> <http://lists.clusterlabs.org/mailman/listinfo/users>
>> 
>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/>
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf>
>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/>
> 
> _______________________________________________
> Users mailing list: [email protected]
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

_______________________________________________
Users mailing list: [email protected]
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd resource

Reply via email to