And just to make sure, I’m not the kind of person who stick to the “we always did it that way…” ;) Just trying to figure out why it suddenly breaks.
-derek -- Derek Wuelfrath [email protected] <mailto:[email protected]> :: +1.514.447.4918 (x110) :: +1.866.353.6153 (x110) Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), PacketFence (www.packetfence.org <https://www.packetfence.org/>) and Fingerbank (www.fingerbank.org <https://www.fingerbank.org/>) > On Nov 15, 2017, at 15:30, Derek Wuelfrath <[email protected]> wrote: > > I agree. Thing is, we have this kind of setup deployed largely and since a > while. Never ran into any issue. > Not sure if something changed in Corosync/Pacemaker code or way of dealing > with systemd resources. > > As said, without a systemd resource, everything just work as it should… 100% > of the time > As soon as a systemd resource comes in, it breaks. > > -derek > > -- > Derek Wuelfrath > [email protected] <mailto:[email protected]> :: +1.514.447.4918 > (x110) :: +1.866.353.6153 (x110) > Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), > PacketFence (www.packetfence.org <https://www.packetfence.org/>) and > Fingerbank (www.fingerbank.org <https://www.fingerbank.org/>) > >> On Nov 14, 2017, at 23:03, Digimer <[email protected] >> <mailto:[email protected]>> wrote: >> >> Quorum doesn't prevent split-brains, stonith (fencing) does. >> >> https://www.alteeve.com/w/The_2-Node_Myth >> <https://www.alteeve.com/w/The_2-Node_Myth> >> >> There is no way to use quorum-only to avoid a potential split-brain. You >> might be able to make it less likely with enough effort, but never prevent >> it. >> >> digimer >> >> On 2017-11-14 10:45 PM, Garima wrote: >>> Hello All, >>> >>> Split-brain situation occurs due to there is a drop in quorum which leads >>> to Spilt-brain situation and status information is not exchanged between >>> both two nodes of the cluster. >>> This can be avoided if quorum communicates between both the nodes. >>> I have checked the code. In My opinion these files need to be updated >>> (quorum.py/stonith.py) to avoid the spilt-brain situation to maintain >>> Active-Passive configuration. >>> >>> Regards, >>> Garima >>> >>> From: Derek Wuelfrath [mailto:[email protected] >>> <mailto:[email protected]>] >>> Sent: 13 November 2017 20:55 >>> To: Cluster Labs - All topics related to open-source clustering welcomed >>> <[email protected]> <mailto:[email protected]> >>> Subject: Re: [ClusterLabs] Pacemaker responsible of DRBD and a systemd >>> resource >>> >>> Hello Ken ! >>> >>> Make sure that the systemd service is not enabled. If pacemaker is >>> managing a service, systemd can't also be trying to start and stop it. >>> >>> It is not. I made sure of this in the first place :) >>> >>> Beyond that, the question is what log messages are there from around >>> the time of the issue (on both nodes). >>> >>> Well, that’s the thing. There is not much log messages telling what is >>> actually happening. The ’systemd’ resource is not even trying to start >>> (nothing in either log for that resource). Here are the logs from my last >>> attempt: >>> Scenario: >>> - Services were running on ‘pancakeFence2’. DRBD was synced and connected >>> - I rebooted ‘pancakeFence2’. Services failed to ‘pancakeFence1’ >>> - After ‘pancakeFence2’ comes back, services are running just fine on >>> ‘pancakeFence1’ but DRBD is in Standalone due to split-brain >>> >>> Logs for pancakeFence1: https://pastebin.com/dVSGPP78 >>> <https://pastebin.com/dVSGPP78> >>> Logs for pancakeFence2: https://pastebin.com/at8qPkHE >>> <https://pastebin.com/at8qPkHE> >>> >>> It really looks like the status checkup mechanism of corosync/pacemaker for >>> a systemd resource force the resource to “start” and therefore, start the >>> ones above that resource in the group (DRBD in instance). >>> This does not happen for a regular OCF resource (IPaddr2 per example) >>> >>> Cheers! >>> -dw >>> >>> -- >>> Derek Wuelfrath >>> [email protected] <mailto:[email protected]> :: +1.514.447.4918 >>> (x110) :: +1.866.353.6153 (x110) >>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <https://www.sogo.nu/>), >>> PacketFence (www.packetfence.org <https://www.packetfence.org/>) and >>> Fingerbank (www.fingerbank.org <https://www.fingerbank.org/>) >>> >>> >>> On Nov 10, 2017, at 11:39, Ken Gaillot <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> On Thu, 2017-11-09 at 20:27 -0500, Derek Wuelfrath wrote: >>> >>> Hello there, >>> >>> First post here but following since a while! >>> >>> Welcome! >>> >>> >>> >>> Here’s my issue, >>> we are putting in place and running this type of cluster since a >>> while and never really encountered this kind of problem. >>> >>> I recently set up a Corosync / Pacemaker / PCS cluster to manage DRBD >>> along with different other resources. Part of theses resources are >>> some systemd resources… this is the part where things are “breaking”. >>> >>> Having a two servers cluster running only DRBD or DRBD with an OCF >>> ipaddr2 resource (Cluser IP in instance) works just fine. I can >>> easily move from one node to the other without any issue. >>> As soon as I add a systemd resource to the resource group, things are >>> breaking. Moving from one node to the other using standby mode works >>> just fine but as soon as Corosync / Pacemaker restart involves >>> polling of a systemd resource, it seems like it is trying to start >>> the whole resource group and therefore, create a split-brain of the >>> DRBD resource. >>> >>> My first two suggestions would be: >>> >>> Make sure that the systemd service is not enabled. If pacemaker is >>> managing a service, systemd can't also be trying to start and stop it. >>> >>> Fencing is the only way pacemaker can resolve split-brains and certain >>> other situations, so that will help in the recovery. >>> >>> Beyond that, the question is what log messages are there from around >>> the time of the issue (on both nodes). >>> >>> >>> >>> >>> It is the best explanation / description of the situation that I can >>> give. If it need any clarification, examples, … I am more than open >>> to share them. >>> >>> Any guidance would be appreciated :) >>> >>> Here’s the output of a ‘pcs config’ >>> >>> https://pastebin.com/1TUvZ4X9 <https://pastebin.com/1TUvZ4X9> >>> >>> Cheers! >>> -dw >>> >>> -- >>> Derek Wuelfrath >>> [email protected] <mailto:[email protected]> :: +1.514.447.4918 >>> (x110) :: +1.866.353.6153 >>> (x110) >>> Inverse inc. :: Leaders behind SOGo (www.sogo.nu <http://www.sogo.nu/>), >>> PacketFence >>> (www.packetfence.org <http://www.packetfence.org/>) and Fingerbank >>> (www.fingerbank.org <http://www.fingerbank.org/>) >>> -- >>> Ken Gaillot <[email protected] <mailto:[email protected]>> >>> >>> _______________________________________________ >>> Users mailing list: [email protected] <mailto:[email protected]> >>> http://lists.clusterlabs.org/mailman/listinfo/users >>> <http://lists.clusterlabs.org/mailman/listinfo/users> >>> >>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> >>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> >>> >>> >>> >>> _______________________________________________ >>> Users mailing list: [email protected] <mailto:[email protected]> >>> http://lists.clusterlabs.org/mailman/listinfo/users >>> <http://lists.clusterlabs.org/mailman/listinfo/users> >>> >>> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> >>> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> >> >> -- >> Digimer >> Papers and Projects: https://alteeve.com/w/ <https://alteeve.com/w/> >> "I am, somehow, less interested in the weight and convolutions of Einstein’s >> brain than in the near certainty that people of equal talent have lived and >> died in cotton fields and sweatshops." - Stephen Jay Gould >> _______________________________________________ >> Users mailing list: [email protected] <mailto:[email protected]> >> http://lists.clusterlabs.org/mailman/listinfo/users >> <http://lists.clusterlabs.org/mailman/listinfo/users> >> >> Project Home: http://www.clusterlabs.org <http://www.clusterlabs.org/> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> <http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf> >> Bugs: http://bugs.clusterlabs.org <http://bugs.clusterlabs.org/> > > _______________________________________________ > Users mailing list: [email protected] > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: [email protected] http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
