Re: [ClusterLabs] issue during Pacemaker failover testing
I just tried removing all the quorum options setting back to defaults so no last_man_standing or wait_for_all. I still see the same behaviour where the third node is fenced if I bring down services on two nodes. Thanks David On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger wrote: > > > On Thu, Aug 31, 2023 at 12:28 PM David Dolan > wrote: > >> >> >> On Wed, 30 Aug 2023 at 17:35, David Dolan wrote: >> >>> >>> >>> > Hi All, > > I'm running Pacemaker on Centos7 > Name: pcs > Version : 0.9.169 > Release : 3.el7.centos.3 > Architecture: x86_64 > > Besides the pcs-version versions of the other cluster-stack-components could be interesting. (pacemaker, corosync) >>> rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" >>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 >>> corosynclib-2.4.5-7.el7_9.2.x86_64 >>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 >>> fence-agents-common-4.2.1-41.el7_9.6.x86_64 >>> corosync-2.4.5-7.el7_9.2.x86_64 >>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64 >>> pacemaker-1.1.23-1.el7_9.1.x86_64 >>> pcs-0.9.169-3.el7.centos.3.x86_64 >>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64 >>> > I'm performing some cluster failover tests in a 3 node cluster. We have 3 > resources in the cluster. > I was trying to see if I could get it working if 2 nodes fail at different > times. I'd like the 3 resources to then run on one node. > > The quorum options I've configured are as follows > [root@node1 ~]# pcs quorum config > Options: > auto_tie_breaker: 1 > last_man_standing: 1 > last_man_standing_window: 1 > wait_for_all: 1 > > Not sure if the combination of auto_tie_breaker and last_man_standing makes sense. And as you have a cluster with an odd number of nodes auto_tie_breaker should be disabled anyway I guess. >>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing >>> > [root@node1 ~]# pcs quorum status > Quorum information > -- > Date: Wed Aug 30 11:20:04 2023 > Quorum provider: corosync_votequorum > Nodes:3 > Node ID: 1 > Ring ID: 1/1538 > Quorate: Yes > > Votequorum information > -- > Expected votes: 3 > Highest expected: 3 > Total votes: 3 > Quorum: 2 > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker > > Membership information > -- > Nodeid VotesQdevice Name > 1 1 NR node1 (local) > 2 1 NR node2 > 3 1 NR node3 > > If I stop the cluster services on node 2 and 3, the groups all failover to > node 1 since it is the node with the lowest ID > But if I stop them on node1 and node 2 or node1 and node3, the cluster > fails. > > I tried adding this line to corosync.conf and I could then bring down the > services on node 1 and 2 or node 2 and 3 but if I left node 2 until last, > the cluster failed > auto_tie_breaker_node: 1 3 > > This line had the same outcome as using 1 3 > auto_tie_breaker_node: 1 2 3 > > Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather sounds dangerous if that configuration is possible at all. Maybe the misbehavior of last_man_standing is due to this (maybe not recognized) misconfiguration. Did you wait long enough between letting the 2 nodes fail? >>> I've done it so many times so I believe so. But I'll try remove the >>> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure >>> I leave a couple of minutes between bringing down the nodes and post back. >>> >> Just confirming I removed the auto_tie_breaker config and tested. Quorum >> configuration is as follows: >> Options: >> last_man_standing: 1 >> last_man_standing_window: 1 >> wait_for_all: 1 >> >> I waited 2-3 minutes between stopping cluster services on two nodes via >> pcs cluster stop >> The remaining cluster node is then fenced. I was hoping the remaining >> node would stay online running the resources. >> > > Yep - that would've been my understanding as well. > But honestly I've never used last_man_standing in this context - wasn't > even aware that it was > offered without qdevice nor have I checked how it is implemented. > > Klaus > >> >> Klaus > So I'd like it to failover when any combination of two nodes fail but I've > only had success when the middle node isn't last. > > Thanks > David ___ Manage your subscription: https://lists.clusterlabs.org/ma
Re: [ClusterLabs] issue during Pacemaker failover testing
On Thu, Aug 31, 2023 at 12:28 PM David Dolan wrote: > > > On Wed, 30 Aug 2023 at 17:35, David Dolan wrote: > >> >> >> > Hi All, >>> > >>> > I'm running Pacemaker on Centos7 >>> > Name: pcs >>> > Version : 0.9.169 >>> > Release : 3.el7.centos.3 >>> > Architecture: x86_64 >>> > >>> > >>> Besides the pcs-version versions of the other cluster-stack-components >>> could be interesting. (pacemaker, corosync) >>> >> rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" >> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 >> corosynclib-2.4.5-7.el7_9.2.x86_64 >> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 >> fence-agents-common-4.2.1-41.el7_9.6.x86_64 >> corosync-2.4.5-7.el7_9.2.x86_64 >> pacemaker-cli-1.1.23-1.el7_9.1.x86_64 >> pacemaker-1.1.23-1.el7_9.1.x86_64 >> pcs-0.9.169-3.el7.centos.3.x86_64 >> pacemaker-libs-1.1.23-1.el7_9.1.x86_64 >> >>> >>> >>> > I'm performing some cluster failover tests in a 3 node cluster. We >>> have 3 >>> > resources in the cluster. >>> > I was trying to see if I could get it working if 2 nodes fail at >>> different >>> > times. I'd like the 3 resources to then run on one node. >>> > >>> > The quorum options I've configured are as follows >>> > [root@node1 ~]# pcs quorum config >>> > Options: >>> > auto_tie_breaker: 1 >>> > last_man_standing: 1 >>> > last_man_standing_window: 1 >>> > wait_for_all: 1 >>> > >>> > >>> Not sure if the combination of auto_tie_breaker and last_man_standing >>> makes >>> sense. >>> And as you have a cluster with an odd number of nodes auto_tie_breaker >>> should be >>> disabled anyway I guess. >>> >> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing >> >>> >>> >>> > [root@node1 ~]# pcs quorum status >>> > Quorum information >>> > -- >>> > Date: Wed Aug 30 11:20:04 2023 >>> > Quorum provider: corosync_votequorum >>> > Nodes:3 >>> > Node ID: 1 >>> > Ring ID: 1/1538 >>> > Quorate: Yes >>> > >>> > Votequorum information >>> > -- >>> > Expected votes: 3 >>> > Highest expected: 3 >>> > Total votes: 3 >>> > Quorum: 2 >>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker >>> > >>> > Membership information >>> > -- >>> > Nodeid VotesQdevice Name >>> > 1 1 NR node1 (local) >>> > 2 1 NR node2 >>> > 3 1 NR node3 >>> > >>> > If I stop the cluster services on node 2 and 3, the groups all >>> failover to >>> > node 1 since it is the node with the lowest ID >>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster >>> > fails. >>> > >>> > I tried adding this line to corosync.conf and I could then bring down >>> the >>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until >>> last, >>> > the cluster failed >>> > auto_tie_breaker_node: 1 3 >>> > >>> > This line had the same outcome as using 1 3 >>> > auto_tie_breaker_node: 1 2 3 >>> > >>> > >>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but >>> rather >>> sounds dangerous if that configuration is possible at all. >>> >>> Maybe the misbehavior of last_man_standing is due to this (maybe not >>> recognized) misconfiguration. >>> Did you wait long enough between letting the 2 nodes fail? >>> >> I've done it so many times so I believe so. But I'll try remove the >> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure >> I leave a couple of minutes between bringing down the nodes and post back. >> > Just confirming I removed the auto_tie_breaker config and tested. Quorum > configuration is as follows: > Options: > last_man_standing: 1 > last_man_standing_window: 1 > wait_for_all: 1 > > I waited 2-3 minutes between stopping cluster services on two nodes via > pcs cluster stop > The remaining cluster node is then fenced. I was hoping the remaining node > would stay online running the resources. > Yep - that would've been my understanding as well. But honestly I've never used last_man_standing in this context - wasn't even aware that it was offered without qdevice nor have I checked how it is implemented. Klaus > > >>> Klaus >>> >>> >>> > So I'd like it to failover when any combination of two nodes fail but >>> I've >>> > only had success when the middle node isn't last. >>> > >>> > Thanks >>> > David >>> >>> >>> >>> ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
On Wed, 30 Aug 2023 at 17:35, David Dolan wrote: > > > > Hi All, >> > >> > I'm running Pacemaker on Centos7 >> > Name: pcs >> > Version : 0.9.169 >> > Release : 3.el7.centos.3 >> > Architecture: x86_64 >> > >> > >> Besides the pcs-version versions of the other cluster-stack-components >> could be interesting. (pacemaker, corosync) >> > rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" > fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 > corosynclib-2.4.5-7.el7_9.2.x86_64 > pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 > fence-agents-common-4.2.1-41.el7_9.6.x86_64 > corosync-2.4.5-7.el7_9.2.x86_64 > pacemaker-cli-1.1.23-1.el7_9.1.x86_64 > pacemaker-1.1.23-1.el7_9.1.x86_64 > pcs-0.9.169-3.el7.centos.3.x86_64 > pacemaker-libs-1.1.23-1.el7_9.1.x86_64 > >> >> >> > I'm performing some cluster failover tests in a 3 node cluster. We have >> 3 >> > resources in the cluster. >> > I was trying to see if I could get it working if 2 nodes fail at >> different >> > times. I'd like the 3 resources to then run on one node. >> > >> > The quorum options I've configured are as follows >> > [root@node1 ~]# pcs quorum config >> > Options: >> > auto_tie_breaker: 1 >> > last_man_standing: 1 >> > last_man_standing_window: 1 >> > wait_for_all: 1 >> > >> > >> Not sure if the combination of auto_tie_breaker and last_man_standing >> makes >> sense. >> And as you have a cluster with an odd number of nodes auto_tie_breaker >> should be >> disabled anyway I guess. >> > Ah ok I'll try removing auto_tie_breaker and leave last_man_standing > >> >> >> > [root@node1 ~]# pcs quorum status >> > Quorum information >> > -- >> > Date: Wed Aug 30 11:20:04 2023 >> > Quorum provider: corosync_votequorum >> > Nodes:3 >> > Node ID: 1 >> > Ring ID: 1/1538 >> > Quorate: Yes >> > >> > Votequorum information >> > -- >> > Expected votes: 3 >> > Highest expected: 3 >> > Total votes: 3 >> > Quorum: 2 >> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker >> > >> > Membership information >> > -- >> > Nodeid VotesQdevice Name >> > 1 1 NR node1 (local) >> > 2 1 NR node2 >> > 3 1 NR node3 >> > >> > If I stop the cluster services on node 2 and 3, the groups all failover >> to >> > node 1 since it is the node with the lowest ID >> > But if I stop them on node1 and node 2 or node1 and node3, the cluster >> > fails. >> > >> > I tried adding this line to corosync.conf and I could then bring down >> the >> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until >> last, >> > the cluster failed >> > auto_tie_breaker_node: 1 3 >> > >> > This line had the same outcome as using 1 3 >> > auto_tie_breaker_node: 1 2 3 >> > >> > >> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather >> sounds dangerous if that configuration is possible at all. >> >> Maybe the misbehavior of last_man_standing is due to this (maybe not >> recognized) misconfiguration. >> Did you wait long enough between letting the 2 nodes fail? >> > I've done it so many times so I believe so. But I'll try remove the > auto_tie_breaker config, leaving the last_man_standing. I'll also make sure > I leave a couple of minutes between bringing down the nodes and post back. > Just confirming I removed the auto_tie_breaker config and tested. Quorum configuration is as follows: Options: last_man_standing: 1 last_man_standing_window: 1 wait_for_all: 1 I waited 2-3 minutes between stopping cluster services on two nodes via pcs cluster stop The remaining cluster node is then fenced. I was hoping the remaining node would stay online running the resources. >> Klaus >> >> >> > So I'd like it to failover when any combination of two nodes fail but >> I've >> > only had success when the middle node isn't last. >> > >> > Thanks >> > David >> >> >> >> ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/