On 2021-02-26 12:23 p.m., Eric Robinson wrote: >> -----Original Message----- >> From: Digimer <[email protected]> >> Sent: Friday, February 26, 2021 10:35 AM >> To: Cluster Labs - All topics related to open-source clustering welcomed >> <[email protected]>; Eric Robinson <[email protected]> >> Subject: Re: [ClusterLabs] Our 2-Node Cluster with a Separate Qdevice Went >> Down Anyway? >> >> On 2021-02-26 11:19 a.m., Eric Robinson wrote: >>> At 5:16 am Pacific time Monday, one of our cluster nodes failed and >>> its mysql services went down. The cluster did not automatically recover. >>> >>> We're trying to figure out: >>> >>> 1. Why did it fail? >>> 2. Why did it not automatically recover? >>> >>> The cluster did not recover until we manually executed... >>> >>> # pcs resource cleanup p_mysql_622 >>> >>> OS: CentOS Linux release 7.5.1804 (Core) >>> >>> Cluster version: >>> >>> corosync.x86_64 2.4.5-4.el7 @base >>> corosync-qdevice.x86_64 2.4.5-4.el7 @base >>> pacemaker.x86_64 1.1.21-4.el7 @base >>> >>> Two nodes: 001db01a, 001db01b >>> >>> The following log snippet is from node 001db01a: >>> >>> [root@001db01a cluster]# grep "Feb 22 05:1[67]" corosync.log-20210223 >> >> <snip> >> >>> Feb 22 05:16:30 [91682] 001db01a pengine: warning: cluster_status: >> Fencing and resource management disabled due to lack of quorum >> >> Seems like there was no quorum from this node's perspective, so it won't do >> anything. What does the other node's logs say? >> > > The logs from the other node are at the bottom of the original email. > >> What is the cluster configuration? Do you have stonith (fencing) configured? > > 2-node with a separate qdevice. No fencing. > >> Quorum is a useful tool when things are working properly, but it doesn't help >> when things enter an undefined / unexpected state. >> When that happens, stonith saves you. So said another way, you must have >> stonith for a stable cluster, quorum is optional. >> > > In this case, if fencing was enabled, which node would have fenced the other? > Would they have gotten into a STONITH war?
You can set a preference for which node wins by assigning a fence delay to your preferred node. So say your services were running on node 1, you put the delay on the fence method that shoots node 1. So in a case like this, node 2 looks up how to fence node 1, sees the delay, and waits. Node 1 looks up how to fence node 2, sees no delay, and fences immediately. If, however, node 1 was actually dead, then after the delay (typically 15 seconds), node proceeds with the fence and takes over the lost services. Without fencing / stonith, what happens during a failure is undetermined. All production clusters really must have fencing. If you also have quorum, then the delay doesn't matter and the node that maintains contact with the quorum node wins. However, if something breaks all cluster communications (corosync, specifically), both nodes lose quorum and neither recover. For this reason, I never bother with quorum (set the two-node flag), and just rely on fencing. Takes avoidable complexity out of the system. > More importantly, why did the failure of resource p_mysql_622 keep the whole > cluster from recovering? As soon as I did 'pcs resource cleanup p_mysql_622' > all the other resources recovered, but none of them are dependent on that > resource. > >> -- >> Digimer >> Papers and Projects: https://alteeve.com/w/ "I am, somehow, less >> interested in the weight and convolutions of Einstein's brain than in the >> near >> certainty that people of equal talent have lived and died in cotton fields >> and >> sweatshops." - Stephen Jay Gould > Disclaimer : This email and any files transmitted with it are confidential > and intended solely for intended recipients. If you are not the named > addressee you should not disseminate, distribute, copy or alter this email. > Any views or opinions presented in this email are solely those of the author > and might not represent those of Physician Select Management. Warning: > Although Physician Select Management has taken reasonable precautions to > ensure no viruses are present in this email, the company cannot accept > responsibility for any loss or damage arising from the use of this email or > attachments. > -- Digimer Papers and Projects: https://alteeve.com/w/ "I am, somehow, less interested in the weight and convolutions of Einstein’s brain than in the near certainty that people of equal talent have lived and died in cotton fields and sweatshops." - Stephen Jay Gould _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
