On Mon, Sep 4, 2023 at 1:18 PM Klaus Wenninger <kwenn...@redhat.com> wrote:
> > > On Mon, Sep 4, 2023 at 12:45 PM David Dolan <daithido...@gmail.com> wrote: > >> Hi Klaus, >> >> With default quorum options I've performed the following on my 3 node >> cluster >> >> Bring down cluster services on one node - the running services migrate to >> another node >> Wait 3 minutes >> Bring down cluster services on one of the two remaining nodes - the >> surviving node in the cluster is then fenced >> >> Instead of the surviving node being fenced, I hoped that the services >> would migrate and run on that remaining node. >> >> Just looking for confirmation that my understanding is ok and if I'm >> missing something? >> > > As said I've never used it ... > Well when down to 2 nodes LMS per definition is getting into trouble as > after another > outage any of them is gonna be alone. In case of an ordered shutdown this > could > possibly be circumvented though. So I guess your fist attempt to enable > auto-tie-breaker > was the right idea. Like this you will have further service at least on > one of the nodes. > So I guess what you were seeing is the right - and unfortunately only > possible - behavior. > Where LMS shines is probably scenarios with substantially more nodes. > Or go for qdevice with LMS where I would expect it to be able to really go down to a single node left - any of the 2 last ones - as there is still qdevice.# Sry for the confusion btw. Klaus > > Klaus > >> >> Thanks >> David >> >> >> >> On Thu, 31 Aug 2023 at 11:59, David Dolan <daithido...@gmail.com> wrote: >> >>> I just tried removing all the quorum options setting back to defaults so >>> no last_man_standing or wait_for_all. >>> I still see the same behaviour where the third node is fenced if I bring >>> down services on two nodes. >>> Thanks >>> David >>> >>> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger <kwenn...@redhat.com> >>> wrote: >>> >>>> >>>> >>>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan <daithido...@gmail.com> >>>> wrote: >>>> >>>>> >>>>> >>>>> On Wed, 30 Aug 2023 at 17:35, David Dolan <daithido...@gmail.com> >>>>> wrote: >>>>> >>>>>> >>>>>> >>>>>> > Hi All, >>>>>>> > >>>>>>> > I'm running Pacemaker on Centos7 >>>>>>> > Name : pcs >>>>>>> > Version : 0.9.169 >>>>>>> > Release : 3.el7.centos.3 >>>>>>> > Architecture: x86_64 >>>>>>> > >>>>>>> > >>>>>>> Besides the pcs-version versions of the other >>>>>>> cluster-stack-components >>>>>>> could be interesting. (pacemaker, corosync) >>>>>>> >>>>>> rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" >>>>>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 >>>>>> corosynclib-2.4.5-7.el7_9.2.x86_64 >>>>>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 >>>>>> fence-agents-common-4.2.1-41.el7_9.6.x86_64 >>>>>> corosync-2.4.5-7.el7_9.2.x86_64 >>>>>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64 >>>>>> pacemaker-1.1.23-1.el7_9.1.x86_64 >>>>>> pcs-0.9.169-3.el7.centos.3.x86_64 >>>>>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64 >>>>>> >>>>>>> >>>>>>> >>>>>>> > I'm performing some cluster failover tests in a 3 node cluster. We >>>>>>> have 3 >>>>>>> > resources in the cluster. >>>>>>> > I was trying to see if I could get it working if 2 nodes fail at >>>>>>> different >>>>>>> > times. I'd like the 3 resources to then run on one node. >>>>>>> > >>>>>>> > The quorum options I've configured are as follows >>>>>>> > [root@node1 ~]# pcs quorum config >>>>>>> > Options: >>>>>>> > auto_tie_breaker: 1 >>>>>>> > last_man_standing: 1 >>>>>>> > last_man_standing_window: 10000 >>>>>>> > wait_for_all: 1 >>>>>>> > >>>>>>> > >>>>>>> Not sure if the combination of auto_tie_breaker and >>>>>>> last_man_standing makes >>>>>>> sense. >>>>>>> And as you have a cluster with an odd number of nodes >>>>>>> auto_tie_breaker >>>>>>> should be >>>>>>> disabled anyway I guess. >>>>>>> >>>>>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing >>>>>> >>>>>>> >>>>>>> >>>>>>> > [root@node1 ~]# pcs quorum status >>>>>>> > Quorum information >>>>>>> > ------------------ >>>>>>> > Date: Wed Aug 30 11:20:04 2023 >>>>>>> > Quorum provider: corosync_votequorum >>>>>>> > Nodes: 3 >>>>>>> > Node ID: 1 >>>>>>> > Ring ID: 1/1538 >>>>>>> > Quorate: Yes >>>>>>> > >>>>>>> > Votequorum information >>>>>>> > ---------------------- >>>>>>> > Expected votes: 3 >>>>>>> > Highest expected: 3 >>>>>>> > Total votes: 3 >>>>>>> > Quorum: 2 >>>>>>> > Flags: Quorate WaitForAll LastManStanding AutoTieBreaker >>>>>>> > >>>>>>> > Membership information >>>>>>> > ---------------------- >>>>>>> > Nodeid Votes Qdevice Name >>>>>>> > 1 1 NR node1 (local) >>>>>>> > 2 1 NR node2 >>>>>>> > 3 1 NR node3 >>>>>>> > >>>>>>> > If I stop the cluster services on node 2 and 3, the groups all >>>>>>> failover to >>>>>>> > node 1 since it is the node with the lowest ID >>>>>>> > But if I stop them on node1 and node 2 or node1 and node3, the >>>>>>> cluster >>>>>>> > fails. >>>>>>> > >>>>>>> > I tried adding this line to corosync.conf and I could then bring >>>>>>> down the >>>>>>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 >>>>>>> until last, >>>>>>> > the cluster failed >>>>>>> > auto_tie_breaker_node: 1 3 >>>>>>> > >>>>>>> > This line had the same outcome as using 1 3 >>>>>>> > auto_tie_breaker_node: 1 2 3 >>>>>>> > >>>>>>> > >>>>>>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but >>>>>>> rather >>>>>>> sounds dangerous if that configuration is possible at all. >>>>>>> >>>>>>> Maybe the misbehavior of last_man_standing is due to this (maybe not >>>>>>> recognized) misconfiguration. >>>>>>> Did you wait long enough between letting the 2 nodes fail? >>>>>>> >>>>>> I've done it so many times so I believe so. But I'll try remove the >>>>>> auto_tie_breaker config, leaving the last_man_standing. I'll also make >>>>>> sure >>>>>> I leave a couple of minutes between bringing down the nodes and post >>>>>> back. >>>>>> >>>>> Just confirming I removed the auto_tie_breaker config and tested. >>>>> Quorum configuration is as follows: >>>>> Options: >>>>> last_man_standing: 1 >>>>> last_man_standing_window: 10000 >>>>> wait_for_all: 1 >>>>> >>>>> I waited 2-3 minutes between stopping cluster services on two nodes >>>>> via pcs cluster stop >>>>> The remaining cluster node is then fenced. I was hoping the remaining >>>>> node would stay online running the resources. >>>>> >>>> >>>> Yep - that would've been my understanding as well. >>>> But honestly I've never used last_man_standing in this context - wasn't >>>> even aware that it was >>>> offered without qdevice nor have I checked how it is implemented. >>>> >>>> Klaus >>>> >>>>> >>>>> >>>>>>> Klaus >>>>>>> >>>>>>> >>>>>>> > So I'd like it to failover when any combination of two nodes fail >>>>>>> but I've >>>>>>> > only had success when the middle node isn't last. >>>>>>> > >>>>>>> > Thanks >>>>>>> > David >>>>>>> >>>>>>> >>>>>>> >>>>>>>
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/