Re: [ClusterLabs] issue during Pacemaker failover testing
On Mon, Sep 4, 2023 at 4:44 PM David Dolan wrote: > > Thanks Klaus\Andrei, > > So if I understand correctly what I'm trying probably shouldn't work. It is impossible to configure corosync (or any other cluster system for that matter) to keep the *arbitrary* last node quorate. It is possible to designate one node as "preferred" and to keep it quorate. Returning to your example: > I tried adding this line to corosync.conf and I could then bring down the > services on node 1 and 2 or node 2 and 3 but if I left node 2 until last, the > cluster failed > auto_tie_breaker_node: 1 3 > Correct. In your scenario the tie breaker is only relevant with two nodes. When the first node is down, the remaining two nodes select the tiebreaker. It can only be node 1 or 3. > This line had the same outcome as using 1 3 > auto_tie_breaker_node: 1 2 3 If it really has the same outcome (i.e. cluster fails when node 2 is left) it is a bug. This line makes nodes 1 or 2 a possible tiebreaker. So the cluster must fail if node 3 is left, not node 2. What most certainly *is* possible - no-quorum-policy=ignore + reliable fencing. This worked just fine in two node clusters without two_node. It does not make the last node quorate, but it allows pacemaker to continue providing services on this node *and* taking over services from other nodes if they were fenced successfully. > And I should attempt setting auto_tie_breaker in corosync and remove > last_man_standing. > Then, I should set up another server with qdevice and configure that using > the LMS algorithm. > > Thanks > David > > On Mon, 4 Sept 2023 at 13:32, Klaus Wenninger wrote: >> >> >> >> On Mon, Sep 4, 2023 at 1:50 PM Andrei Borzenkov wrote: >>> >>> On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger wrote: >>> > >>> > >>> > >>> > On Mon, Sep 4, 2023 at 12:45 PM David Dolan wrote: >>> >> >>> >> Hi Klaus, >>> >> >>> >> With default quorum options I've performed the following on my 3 node >>> >> cluster >>> >> >>> >> Bring down cluster services on one node - the running services migrate >>> >> to another node >>> >> Wait 3 minutes >>> >> Bring down cluster services on one of the two remaining nodes - the >>> >> surviving node in the cluster is then fenced >>> >> >>> >> Instead of the surviving node being fenced, I hoped that the services >>> >> would migrate and run on that remaining node. >>> >> >>> >> Just looking for confirmation that my understanding is ok and if I'm >>> >> missing something? >>> > >>> > >>> > As said I've never used it ... >>> > Well when down to 2 nodes LMS per definition is getting into trouble as >>> > after another >>> > outage any of them is gonna be alone. In case of an ordered shutdown this >>> > could >>> > possibly be circumvented though. So I guess your fist attempt to enable >>> > auto-tie-breaker >>> > was the right idea. Like this you will have further service at least on >>> > one of the nodes. >>> > So I guess what you were seeing is the right - and unfortunately only >>> > possible - behavior. >>> >>> I still do not see where fencing comes from. Pacemaker requests >>> fencing of the missing nodes. It also may request self-fencing, but >>> not in the default settings. It is rather hard to tell what happens >>> without logs from the last remaining node. >>> >>> That said, the default action is to stop all resources, so the end >>> result is not very different :) >> >> >> But you are of course right. The expected behaviour would be that >> the leftover node stops the resources. >> But maybe we're missing something here. Hard to tell without >> the exact configuration including fencing. >> Again, as already said, I don't know anything about the LMS >> implementation with corosync. In theory there were both arguments >> to either suicide (but that would have to be done by pacemaker) or >> to automatically switch to some 2-node-mode once the remaining >> partition is reduced to just 2 followed by a fence-race (when done >> without the precautions otherwise used for 2-node-clusters). >> But I guess in this case it is none of those 2. >> >> Klaus >>> >>> ___ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
Thanks Klaus\Andrei, So if I understand correctly what I'm trying probably shouldn't work. And I should attempt setting auto_tie_breaker in corosync and remove last_man_standing. Then, I should set up another server with qdevice and configure that using the LMS algorithm. Thanks David On Mon, 4 Sept 2023 at 13:32, Klaus Wenninger wrote: > > > On Mon, Sep 4, 2023 at 1:50 PM Andrei Borzenkov > wrote: > >> On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger >> wrote: >> > >> > >> > >> > On Mon, Sep 4, 2023 at 12:45 PM David Dolan >> wrote: >> >> >> >> Hi Klaus, >> >> >> >> With default quorum options I've performed the following on my 3 node >> cluster >> >> >> >> Bring down cluster services on one node - the running services migrate >> to another node >> >> Wait 3 minutes >> >> Bring down cluster services on one of the two remaining nodes - the >> surviving node in the cluster is then fenced >> >> >> >> Instead of the surviving node being fenced, I hoped that the services >> would migrate and run on that remaining node. >> >> >> >> Just looking for confirmation that my understanding is ok and if I'm >> missing something? >> > >> > >> > As said I've never used it ... >> > Well when down to 2 nodes LMS per definition is getting into trouble as >> after another >> > outage any of them is gonna be alone. In case of an ordered shutdown >> this could >> > possibly be circumvented though. So I guess your fist attempt to enable >> auto-tie-breaker >> > was the right idea. Like this you will have further service at least on >> one of the nodes. >> > So I guess what you were seeing is the right - and unfortunately only >> possible - behavior. >> >> I still do not see where fencing comes from. Pacemaker requests >> fencing of the missing nodes. It also may request self-fencing, but >> not in the default settings. It is rather hard to tell what happens >> without logs from the last remaining node. >> >> That said, the default action is to stop all resources, so the end >> result is not very different :) >> > > But you are of course right. The expected behaviour would be that > the leftover node stops the resources. > But maybe we're missing something here. Hard to tell without > the exact configuration including fencing. > Again, as already said, I don't know anything about the LMS > implementation with corosync. In theory there were both arguments > to either suicide (but that would have to be done by pacemaker) or > to automatically switch to some 2-node-mode once the remaining > partition is reduced to just 2 followed by a fence-race (when done > without the precautions otherwise used for 2-node-clusters). > But I guess in this case it is none of those 2. > > Klaus > >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
On Mon, Sep 4, 2023 at 1:50 PM Andrei Borzenkov wrote: > On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger > wrote: > > > > > > > > On Mon, Sep 4, 2023 at 12:45 PM David Dolan > wrote: > >> > >> Hi Klaus, > >> > >> With default quorum options I've performed the following on my 3 node > cluster > >> > >> Bring down cluster services on one node - the running services migrate > to another node > >> Wait 3 minutes > >> Bring down cluster services on one of the two remaining nodes - the > surviving node in the cluster is then fenced > >> > >> Instead of the surviving node being fenced, I hoped that the services > would migrate and run on that remaining node. > >> > >> Just looking for confirmation that my understanding is ok and if I'm > missing something? > > > > > > As said I've never used it ... > > Well when down to 2 nodes LMS per definition is getting into trouble as > after another > > outage any of them is gonna be alone. In case of an ordered shutdown > this could > > possibly be circumvented though. So I guess your fist attempt to enable > auto-tie-breaker > > was the right idea. Like this you will have further service at least on > one of the nodes. > > So I guess what you were seeing is the right - and unfortunately only > possible - behavior. > > I still do not see where fencing comes from. Pacemaker requests > fencing of the missing nodes. It also may request self-fencing, but > not in the default settings. It is rather hard to tell what happens > without logs from the last remaining node. > > That said, the default action is to stop all resources, so the end > result is not very different :) > But you are of course right. The expected behaviour would be that the leftover node stops the resources. But maybe we're missing something here. Hard to tell without the exact configuration including fencing. Again, as already said, I don't know anything about the LMS implementation with corosync. In theory there were both arguments to either suicide (but that would have to be done by pacemaker) or to automatically switch to some 2-node-mode once the remaining partition is reduced to just 2 followed by a fence-race (when done without the precautions otherwise used for 2-node-clusters). But I guess in this case it is none of those 2. Klaus > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
On Mon, Sep 4, 2023 at 1:44 PM Andrei Borzenkov wrote: > On Mon, Sep 4, 2023 at 2:25 PM Klaus Wenninger > wrote: > > > > > > Or go for qdevice with LMS where I would expect it to be able to really > go down to > > a single node left - any of the 2 last ones - as there is still qdevice.# > > Sry for the confusion btw. > > > > According to documentation, "LMS is also incompatible with quorum > devices, if last_man_standing is specified in corosync.conf then the > quorum device will be disabled". > That is why I said qdevice with LMS - but it was probably not explicit enough without telling that I meant the qdevice algorithm and not the corosync flag. Klaus > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger wrote: > > > > On Mon, Sep 4, 2023 at 12:45 PM David Dolan wrote: >> >> Hi Klaus, >> >> With default quorum options I've performed the following on my 3 node cluster >> >> Bring down cluster services on one node - the running services migrate to >> another node >> Wait 3 minutes >> Bring down cluster services on one of the two remaining nodes - the >> surviving node in the cluster is then fenced >> >> Instead of the surviving node being fenced, I hoped that the services would >> migrate and run on that remaining node. >> >> Just looking for confirmation that my understanding is ok and if I'm missing >> something? > > > As said I've never used it ... > Well when down to 2 nodes LMS per definition is getting into trouble as after > another > outage any of them is gonna be alone. In case of an ordered shutdown this > could > possibly be circumvented though. So I guess your fist attempt to enable > auto-tie-breaker > was the right idea. Like this you will have further service at least on one > of the nodes. > So I guess what you were seeing is the right - and unfortunately only > possible - behavior. I still do not see where fencing comes from. Pacemaker requests fencing of the missing nodes. It also may request self-fencing, but not in the default settings. It is rather hard to tell what happens without logs from the last remaining node. That said, the default action is to stop all resources, so the end result is not very different :) ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
On Mon, Sep 4, 2023 at 2:25 PM Klaus Wenninger wrote: > > > Or go for qdevice with LMS where I would expect it to be able to really go > down to > a single node left - any of the 2 last ones - as there is still qdevice.# > Sry for the confusion btw. > According to documentation, "LMS is also incompatible with quorum devices, if last_man_standing is specified in corosync.conf then the quorum device will be disabled". ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
On Mon, Sep 4, 2023 at 1:18 PM Klaus Wenninger wrote: > > > On Mon, Sep 4, 2023 at 12:45 PM David Dolan wrote: > >> Hi Klaus, >> >> With default quorum options I've performed the following on my 3 node >> cluster >> >> Bring down cluster services on one node - the running services migrate to >> another node >> Wait 3 minutes >> Bring down cluster services on one of the two remaining nodes - the >> surviving node in the cluster is then fenced >> >> Instead of the surviving node being fenced, I hoped that the services >> would migrate and run on that remaining node. >> >> Just looking for confirmation that my understanding is ok and if I'm >> missing something? >> > > As said I've never used it ... > Well when down to 2 nodes LMS per definition is getting into trouble as > after another > outage any of them is gonna be alone. In case of an ordered shutdown this > could > possibly be circumvented though. So I guess your fist attempt to enable > auto-tie-breaker > was the right idea. Like this you will have further service at least on > one of the nodes. > So I guess what you were seeing is the right - and unfortunately only > possible - behavior. > Where LMS shines is probably scenarios with substantially more nodes. > Or go for qdevice with LMS where I would expect it to be able to really go down to a single node left - any of the 2 last ones - as there is still qdevice.# Sry for the confusion btw. Klaus > > Klaus > >> >> Thanks >> David >> >> >> >> On Thu, 31 Aug 2023 at 11:59, David Dolan wrote: >> >>> I just tried removing all the quorum options setting back to defaults so >>> no last_man_standing or wait_for_all. >>> I still see the same behaviour where the third node is fenced if I bring >>> down services on two nodes. >>> Thanks >>> David >>> >>> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger >>> wrote: >>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan wrote: > > > On Wed, 30 Aug 2023 at 17:35, David Dolan > wrote: > >> >> >> > Hi All, >>> > >>> > I'm running Pacemaker on Centos7 >>> > Name: pcs >>> > Version : 0.9.169 >>> > Release : 3.el7.centos.3 >>> > Architecture: x86_64 >>> > >>> > >>> Besides the pcs-version versions of the other >>> cluster-stack-components >>> could be interesting. (pacemaker, corosync) >>> >> rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" >> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 >> corosynclib-2.4.5-7.el7_9.2.x86_64 >> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 >> fence-agents-common-4.2.1-41.el7_9.6.x86_64 >> corosync-2.4.5-7.el7_9.2.x86_64 >> pacemaker-cli-1.1.23-1.el7_9.1.x86_64 >> pacemaker-1.1.23-1.el7_9.1.x86_64 >> pcs-0.9.169-3.el7.centos.3.x86_64 >> pacemaker-libs-1.1.23-1.el7_9.1.x86_64 >> >>> >>> >>> > I'm performing some cluster failover tests in a 3 node cluster. We >>> have 3 >>> > resources in the cluster. >>> > I was trying to see if I could get it working if 2 nodes fail at >>> different >>> > times. I'd like the 3 resources to then run on one node. >>> > >>> > The quorum options I've configured are as follows >>> > [root@node1 ~]# pcs quorum config >>> > Options: >>> > auto_tie_breaker: 1 >>> > last_man_standing: 1 >>> > last_man_standing_window: 1 >>> > wait_for_all: 1 >>> > >>> > >>> Not sure if the combination of auto_tie_breaker and >>> last_man_standing makes >>> sense. >>> And as you have a cluster with an odd number of nodes >>> auto_tie_breaker >>> should be >>> disabled anyway I guess. >>> >> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing >> >>> >>> >>> > [root@node1 ~]# pcs quorum status >>> > Quorum information >>> > -- >>> > Date: Wed Aug 30 11:20:04 2023 >>> > Quorum provider: corosync_votequorum >>> > Nodes:3 >>> > Node ID: 1 >>> > Ring ID: 1/1538 >>> > Quorate: Yes >>> > >>> > Votequorum information >>> > -- >>> > Expected votes: 3 >>> > Highest expected: 3 >>> > Total votes: 3 >>> > Quorum: 2 >>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker >>> > >>> > Membership information >>> > -- >>> > Nodeid VotesQdevice Name >>> > 1 1 NR node1 (local) >>> > 2 1 NR node2 >>> > 3 1 NR node3 >>> > >>> > If I stop the cluster services on node 2 and 3, the groups all >>> failover to >>> > node 1 since it is the node with the lowest ID >>> > But if I stop them on node1 and node 2 or node1 and node3, the >>> cluster >>> > fails. >>> > >>> > I tried
Re: [ClusterLabs] issue during Pacemaker failover testing
On Mon, Sep 4, 2023 at 12:45 PM David Dolan wrote: > Hi Klaus, > > With default quorum options I've performed the following on my 3 node > cluster > > Bring down cluster services on one node - the running services migrate to > another node > Wait 3 minutes > Bring down cluster services on one of the two remaining nodes - the > surviving node in the cluster is then fenced > > Instead of the surviving node being fenced, I hoped that the services > would migrate and run on that remaining node. > > Just looking for confirmation that my understanding is ok and if I'm > missing something? > As said I've never used it ... Well when down to 2 nodes LMS per definition is getting into trouble as after another outage any of them is gonna be alone. In case of an ordered shutdown this could possibly be circumvented though. So I guess your fist attempt to enable auto-tie-breaker was the right idea. Like this you will have further service at least on one of the nodes. So I guess what you were seeing is the right - and unfortunately only possible - behavior. Where LMS shines is probably scenarios with substantially more nodes. Klaus > > Thanks > David > > > > On Thu, 31 Aug 2023 at 11:59, David Dolan wrote: > >> I just tried removing all the quorum options setting back to defaults so >> no last_man_standing or wait_for_all. >> I still see the same behaviour where the third node is fenced if I bring >> down services on two nodes. >> Thanks >> David >> >> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger >> wrote: >> >>> >>> >>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan >>> wrote: >>> On Wed, 30 Aug 2023 at 17:35, David Dolan wrote: > > > > Hi All, >> > >> > I'm running Pacemaker on Centos7 >> > Name: pcs >> > Version : 0.9.169 >> > Release : 3.el7.centos.3 >> > Architecture: x86_64 >> > >> > >> Besides the pcs-version versions of the other cluster-stack-components >> could be interesting. (pacemaker, corosync) >> > rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" > fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 > corosynclib-2.4.5-7.el7_9.2.x86_64 > pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 > fence-agents-common-4.2.1-41.el7_9.6.x86_64 > corosync-2.4.5-7.el7_9.2.x86_64 > pacemaker-cli-1.1.23-1.el7_9.1.x86_64 > pacemaker-1.1.23-1.el7_9.1.x86_64 > pcs-0.9.169-3.el7.centos.3.x86_64 > pacemaker-libs-1.1.23-1.el7_9.1.x86_64 > >> >> >> > I'm performing some cluster failover tests in a 3 node cluster. We >> have 3 >> > resources in the cluster. >> > I was trying to see if I could get it working if 2 nodes fail at >> different >> > times. I'd like the 3 resources to then run on one node. >> > >> > The quorum options I've configured are as follows >> > [root@node1 ~]# pcs quorum config >> > Options: >> > auto_tie_breaker: 1 >> > last_man_standing: 1 >> > last_man_standing_window: 1 >> > wait_for_all: 1 >> > >> > >> Not sure if the combination of auto_tie_breaker and last_man_standing >> makes >> sense. >> And as you have a cluster with an odd number of nodes auto_tie_breaker >> should be >> disabled anyway I guess. >> > Ah ok I'll try removing auto_tie_breaker and leave last_man_standing > >> >> >> > [root@node1 ~]# pcs quorum status >> > Quorum information >> > -- >> > Date: Wed Aug 30 11:20:04 2023 >> > Quorum provider: corosync_votequorum >> > Nodes:3 >> > Node ID: 1 >> > Ring ID: 1/1538 >> > Quorate: Yes >> > >> > Votequorum information >> > -- >> > Expected votes: 3 >> > Highest expected: 3 >> > Total votes: 3 >> > Quorum: 2 >> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker >> > >> > Membership information >> > -- >> > Nodeid VotesQdevice Name >> > 1 1 NR node1 (local) >> > 2 1 NR node2 >> > 3 1 NR node3 >> > >> > If I stop the cluster services on node 2 and 3, the groups all >> failover to >> > node 1 since it is the node with the lowest ID >> > But if I stop them on node1 and node 2 or node1 and node3, the >> cluster >> > fails. >> > >> > I tried adding this line to corosync.conf and I could then bring >> down the >> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until >> last, >> > the cluster failed >> > auto_tie_breaker_node: 1 3 >> > >> > This line had the same outcome as using 1 3 >> > auto_tie_breaker_node: 1 2 3 >> > >> > >> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but >> rather
Re: [ClusterLabs] issue during Pacemaker failover testing
On Mon, Sep 4, 2023 at 1:45 PM David Dolan wrote: > > Hi Klaus, > > With default quorum options I've performed the following on my 3 node cluster > > Bring down cluster services on one node - the running services migrate to > another node > Wait 3 minutes > Bring down cluster services on one of the two remaining nodes - the surviving > node in the cluster is then fenced > Is it fenced or is it reset? It is not the same. The default for no-quorum-policy is "stop". So you either have "no-quorum-policy" set to "suicide", or node is reset by something outside of pacemaker. This "something" may initiate fencing too. > Instead of the surviving node being fenced, I hoped that the services would > migrate and run on that remaining node. > > Just looking for confirmation that my understanding is ok and if I'm missing > something? > > Thanks > David > > > > On Thu, 31 Aug 2023 at 11:59, David Dolan wrote: >> >> I just tried removing all the quorum options setting back to defaults so no >> last_man_standing or wait_for_all. >> I still see the same behaviour where the third node is fenced if I bring >> down services on two nodes. >> Thanks >> David >> >> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger wrote: >>> >>> >>> >>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan wrote: On Wed, 30 Aug 2023 at 17:35, David Dolan wrote: > > > >> > Hi All, >> > >> > I'm running Pacemaker on Centos7 >> > Name: pcs >> > Version : 0.9.169 >> > Release : 3.el7.centos.3 >> > Architecture: x86_64 >> > >> > >> Besides the pcs-version versions of the other cluster-stack-components >> could be interesting. (pacemaker, corosync) > > rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" > fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 > corosynclib-2.4.5-7.el7_9.2.x86_64 > pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 > fence-agents-common-4.2.1-41.el7_9.6.x86_64 > corosync-2.4.5-7.el7_9.2.x86_64 > pacemaker-cli-1.1.23-1.el7_9.1.x86_64 > pacemaker-1.1.23-1.el7_9.1.x86_64 > pcs-0.9.169-3.el7.centos.3.x86_64 > pacemaker-libs-1.1.23-1.el7_9.1.x86_64 >> >> >> >> > I'm performing some cluster failover tests in a 3 node cluster. We >> > have 3 >> > resources in the cluster. >> > I was trying to see if I could get it working if 2 nodes fail at >> > different >> > times. I'd like the 3 resources to then run on one node. >> > >> > The quorum options I've configured are as follows >> > [root@node1 ~]# pcs quorum config >> > Options: >> > auto_tie_breaker: 1 >> > last_man_standing: 1 >> > last_man_standing_window: 1 >> > wait_for_all: 1 >> > >> > >> Not sure if the combination of auto_tie_breaker and last_man_standing >> makes >> sense. >> And as you have a cluster with an odd number of nodes auto_tie_breaker >> should be >> disabled anyway I guess. > > Ah ok I'll try removing auto_tie_breaker and leave last_man_standing >> >> >> >> > [root@node1 ~]# pcs quorum status >> > Quorum information >> > -- >> > Date: Wed Aug 30 11:20:04 2023 >> > Quorum provider: corosync_votequorum >> > Nodes:3 >> > Node ID: 1 >> > Ring ID: 1/1538 >> > Quorate: Yes >> > >> > Votequorum information >> > -- >> > Expected votes: 3 >> > Highest expected: 3 >> > Total votes: 3 >> > Quorum: 2 >> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker >> > >> > Membership information >> > -- >> > Nodeid VotesQdevice Name >> > 1 1 NR node1 (local) >> > 2 1 NR node2 >> > 3 1 NR node3 >> > >> > If I stop the cluster services on node 2 and 3, the groups all >> > failover to >> > node 1 since it is the node with the lowest ID >> > But if I stop them on node1 and node 2 or node1 and node3, the cluster >> > fails. >> > >> > I tried adding this line to corosync.conf and I could then bring down >> > the >> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until >> > last, >> > the cluster failed >> > auto_tie_breaker_node: 1 3 >> > >> > This line had the same outcome as using 1 3 >> > auto_tie_breaker_node: 1 2 3 >> > >> > >> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but >> rather >> sounds dangerous if that configuration is possible at all. >> >> Maybe the misbehavior of last_man_standing is due to this (maybe not >> recognized) misconfiguration. >> Did you wait long enough between letting the 2 nodes fail? > > I've done it so many times so I
Re: [ClusterLabs] issue during Pacemaker failover testing
Hi Klaus, With default quorum options I've performed the following on my 3 node cluster Bring down cluster services on one node - the running services migrate to another node Wait 3 minutes Bring down cluster services on one of the two remaining nodes - the surviving node in the cluster is then fenced Instead of the surviving node being fenced, I hoped that the services would migrate and run on that remaining node. Just looking for confirmation that my understanding is ok and if I'm missing something? Thanks David On Thu, 31 Aug 2023 at 11:59, David Dolan wrote: > I just tried removing all the quorum options setting back to defaults so > no last_man_standing or wait_for_all. > I still see the same behaviour where the third node is fenced if I bring > down services on two nodes. > Thanks > David > > On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger wrote: > >> >> >> On Thu, Aug 31, 2023 at 12:28 PM David Dolan >> wrote: >> >>> >>> >>> On Wed, 30 Aug 2023 at 17:35, David Dolan wrote: >>> > Hi All, > > > > I'm running Pacemaker on Centos7 > > Name: pcs > > Version : 0.9.169 > > Release : 3.el7.centos.3 > > Architecture: x86_64 > > > > > Besides the pcs-version versions of the other cluster-stack-components > could be interesting. (pacemaker, corosync) > rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 corosynclib-2.4.5-7.el7_9.2.x86_64 pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 fence-agents-common-4.2.1-41.el7_9.6.x86_64 corosync-2.4.5-7.el7_9.2.x86_64 pacemaker-cli-1.1.23-1.el7_9.1.x86_64 pacemaker-1.1.23-1.el7_9.1.x86_64 pcs-0.9.169-3.el7.centos.3.x86_64 pacemaker-libs-1.1.23-1.el7_9.1.x86_64 > > > > I'm performing some cluster failover tests in a 3 node cluster. We > have 3 > > resources in the cluster. > > I was trying to see if I could get it working if 2 nodes fail at > different > > times. I'd like the 3 resources to then run on one node. > > > > The quorum options I've configured are as follows > > [root@node1 ~]# pcs quorum config > > Options: > > auto_tie_breaker: 1 > > last_man_standing: 1 > > last_man_standing_window: 1 > > wait_for_all: 1 > > > > > Not sure if the combination of auto_tie_breaker and last_man_standing > makes > sense. > And as you have a cluster with an odd number of nodes auto_tie_breaker > should be > disabled anyway I guess. > Ah ok I'll try removing auto_tie_breaker and leave last_man_standing > > > > [root@node1 ~]# pcs quorum status > > Quorum information > > -- > > Date: Wed Aug 30 11:20:04 2023 > > Quorum provider: corosync_votequorum > > Nodes:3 > > Node ID: 1 > > Ring ID: 1/1538 > > Quorate: Yes > > > > Votequorum information > > -- > > Expected votes: 3 > > Highest expected: 3 > > Total votes: 3 > > Quorum: 2 > > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker > > > > Membership information > > -- > > Nodeid VotesQdevice Name > > 1 1 NR node1 (local) > > 2 1 NR node2 > > 3 1 NR node3 > > > > If I stop the cluster services on node 2 and 3, the groups all > failover to > > node 1 since it is the node with the lowest ID > > But if I stop them on node1 and node 2 or node1 and node3, the > cluster > > fails. > > > > I tried adding this line to corosync.conf and I could then bring > down the > > services on node 1 and 2 or node 2 and 3 but if I left node 2 until > last, > > the cluster failed > > auto_tie_breaker_node: 1 3 > > > > This line had the same outcome as using 1 3 > > auto_tie_breaker_node: 1 2 3 > > > > > Giving multiple auto_tie_breaker-nodes doesn't make sense to me but > rather > sounds dangerous if that configuration is possible at all. > > Maybe the misbehavior of last_man_standing is due to this (maybe not > recognized) misconfiguration. > Did you wait long enough between letting the 2 nodes fail? > I've done it so many times so I believe so. But I'll try remove the auto_tie_breaker config, leaving the last_man_standing. I'll also make sure I leave a couple of minutes between bringing down the nodes and post back. >>> Just confirming I removed the auto_tie_breaker config and tested. Quorum >>> configuration is as follows: >>> Options: >>> last_man_standing: 1 >>> last_man_standing_window: 1 >>> wait_for_all: 1 >>> >>> I waited 2-3 minutes between stopping cluster services
Re: [ClusterLabs] issue during Pacemaker failover testing
I just tried removing all the quorum options setting back to defaults so no last_man_standing or wait_for_all. I still see the same behaviour where the third node is fenced if I bring down services on two nodes. Thanks David On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger wrote: > > > On Thu, Aug 31, 2023 at 12:28 PM David Dolan > wrote: > >> >> >> On Wed, 30 Aug 2023 at 17:35, David Dolan wrote: >> >>> >>> >>> > Hi All, > > I'm running Pacemaker on Centos7 > Name: pcs > Version : 0.9.169 > Release : 3.el7.centos.3 > Architecture: x86_64 > > Besides the pcs-version versions of the other cluster-stack-components could be interesting. (pacemaker, corosync) >>> rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" >>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 >>> corosynclib-2.4.5-7.el7_9.2.x86_64 >>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 >>> fence-agents-common-4.2.1-41.el7_9.6.x86_64 >>> corosync-2.4.5-7.el7_9.2.x86_64 >>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64 >>> pacemaker-1.1.23-1.el7_9.1.x86_64 >>> pcs-0.9.169-3.el7.centos.3.x86_64 >>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64 >>> > I'm performing some cluster failover tests in a 3 node cluster. We have 3 > resources in the cluster. > I was trying to see if I could get it working if 2 nodes fail at different > times. I'd like the 3 resources to then run on one node. > > The quorum options I've configured are as follows > [root@node1 ~]# pcs quorum config > Options: > auto_tie_breaker: 1 > last_man_standing: 1 > last_man_standing_window: 1 > wait_for_all: 1 > > Not sure if the combination of auto_tie_breaker and last_man_standing makes sense. And as you have a cluster with an odd number of nodes auto_tie_breaker should be disabled anyway I guess. >>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing >>> > [root@node1 ~]# pcs quorum status > Quorum information > -- > Date: Wed Aug 30 11:20:04 2023 > Quorum provider: corosync_votequorum > Nodes:3 > Node ID: 1 > Ring ID: 1/1538 > Quorate: Yes > > Votequorum information > -- > Expected votes: 3 > Highest expected: 3 > Total votes: 3 > Quorum: 2 > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker > > Membership information > -- > Nodeid VotesQdevice Name > 1 1 NR node1 (local) > 2 1 NR node2 > 3 1 NR node3 > > If I stop the cluster services on node 2 and 3, the groups all failover to > node 1 since it is the node with the lowest ID > But if I stop them on node1 and node 2 or node1 and node3, the cluster > fails. > > I tried adding this line to corosync.conf and I could then bring down the > services on node 1 and 2 or node 2 and 3 but if I left node 2 until last, > the cluster failed > auto_tie_breaker_node: 1 3 > > This line had the same outcome as using 1 3 > auto_tie_breaker_node: 1 2 3 > > Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather sounds dangerous if that configuration is possible at all. Maybe the misbehavior of last_man_standing is due to this (maybe not recognized) misconfiguration. Did you wait long enough between letting the 2 nodes fail? >>> I've done it so many times so I believe so. But I'll try remove the >>> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure >>> I leave a couple of minutes between bringing down the nodes and post back. >>> >> Just confirming I removed the auto_tie_breaker config and tested. Quorum >> configuration is as follows: >> Options: >> last_man_standing: 1 >> last_man_standing_window: 1 >> wait_for_all: 1 >> >> I waited 2-3 minutes between stopping cluster services on two nodes via >> pcs cluster stop >> The remaining cluster node is then fenced. I was hoping the remaining >> node would stay online running the resources. >> > > Yep - that would've been my understanding as well. > But honestly I've never used last_man_standing in this context - wasn't > even aware that it was > offered without qdevice nor have I checked how it is implemented. > > Klaus > >> >> Klaus > So I'd like it to failover when any combination of two nodes fail but I've > only had success when the middle node isn't last. > > Thanks > David ___ Manage your subscription:
Re: [ClusterLabs] issue during Pacemaker failover testing
On Thu, Aug 31, 2023 at 12:28 PM David Dolan wrote: > > > On Wed, 30 Aug 2023 at 17:35, David Dolan wrote: > >> >> >> > Hi All, >>> > >>> > I'm running Pacemaker on Centos7 >>> > Name: pcs >>> > Version : 0.9.169 >>> > Release : 3.el7.centos.3 >>> > Architecture: x86_64 >>> > >>> > >>> Besides the pcs-version versions of the other cluster-stack-components >>> could be interesting. (pacemaker, corosync) >>> >> rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" >> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 >> corosynclib-2.4.5-7.el7_9.2.x86_64 >> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 >> fence-agents-common-4.2.1-41.el7_9.6.x86_64 >> corosync-2.4.5-7.el7_9.2.x86_64 >> pacemaker-cli-1.1.23-1.el7_9.1.x86_64 >> pacemaker-1.1.23-1.el7_9.1.x86_64 >> pcs-0.9.169-3.el7.centos.3.x86_64 >> pacemaker-libs-1.1.23-1.el7_9.1.x86_64 >> >>> >>> >>> > I'm performing some cluster failover tests in a 3 node cluster. We >>> have 3 >>> > resources in the cluster. >>> > I was trying to see if I could get it working if 2 nodes fail at >>> different >>> > times. I'd like the 3 resources to then run on one node. >>> > >>> > The quorum options I've configured are as follows >>> > [root@node1 ~]# pcs quorum config >>> > Options: >>> > auto_tie_breaker: 1 >>> > last_man_standing: 1 >>> > last_man_standing_window: 1 >>> > wait_for_all: 1 >>> > >>> > >>> Not sure if the combination of auto_tie_breaker and last_man_standing >>> makes >>> sense. >>> And as you have a cluster with an odd number of nodes auto_tie_breaker >>> should be >>> disabled anyway I guess. >>> >> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing >> >>> >>> >>> > [root@node1 ~]# pcs quorum status >>> > Quorum information >>> > -- >>> > Date: Wed Aug 30 11:20:04 2023 >>> > Quorum provider: corosync_votequorum >>> > Nodes:3 >>> > Node ID: 1 >>> > Ring ID: 1/1538 >>> > Quorate: Yes >>> > >>> > Votequorum information >>> > -- >>> > Expected votes: 3 >>> > Highest expected: 3 >>> > Total votes: 3 >>> > Quorum: 2 >>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker >>> > >>> > Membership information >>> > -- >>> > Nodeid VotesQdevice Name >>> > 1 1 NR node1 (local) >>> > 2 1 NR node2 >>> > 3 1 NR node3 >>> > >>> > If I stop the cluster services on node 2 and 3, the groups all >>> failover to >>> > node 1 since it is the node with the lowest ID >>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster >>> > fails. >>> > >>> > I tried adding this line to corosync.conf and I could then bring down >>> the >>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until >>> last, >>> > the cluster failed >>> > auto_tie_breaker_node: 1 3 >>> > >>> > This line had the same outcome as using 1 3 >>> > auto_tie_breaker_node: 1 2 3 >>> > >>> > >>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but >>> rather >>> sounds dangerous if that configuration is possible at all. >>> >>> Maybe the misbehavior of last_man_standing is due to this (maybe not >>> recognized) misconfiguration. >>> Did you wait long enough between letting the 2 nodes fail? >>> >> I've done it so many times so I believe so. But I'll try remove the >> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure >> I leave a couple of minutes between bringing down the nodes and post back. >> > Just confirming I removed the auto_tie_breaker config and tested. Quorum > configuration is as follows: > Options: > last_man_standing: 1 > last_man_standing_window: 1 > wait_for_all: 1 > > I waited 2-3 minutes between stopping cluster services on two nodes via > pcs cluster stop > The remaining cluster node is then fenced. I was hoping the remaining node > would stay online running the resources. > Yep - that would've been my understanding as well. But honestly I've never used last_man_standing in this context - wasn't even aware that it was offered without qdevice nor have I checked how it is implemented. Klaus > > >>> Klaus >>> >>> >>> > So I'd like it to failover when any combination of two nodes fail but >>> I've >>> > only had success when the middle node isn't last. >>> > >>> > Thanks >>> > David >>> >>> >>> >>> ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
On Wed, 30 Aug 2023 at 17:35, David Dolan wrote: > > > > Hi All, >> > >> > I'm running Pacemaker on Centos7 >> > Name: pcs >> > Version : 0.9.169 >> > Release : 3.el7.centos.3 >> > Architecture: x86_64 >> > >> > >> Besides the pcs-version versions of the other cluster-stack-components >> could be interesting. (pacemaker, corosync) >> > rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" > fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 > corosynclib-2.4.5-7.el7_9.2.x86_64 > pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 > fence-agents-common-4.2.1-41.el7_9.6.x86_64 > corosync-2.4.5-7.el7_9.2.x86_64 > pacemaker-cli-1.1.23-1.el7_9.1.x86_64 > pacemaker-1.1.23-1.el7_9.1.x86_64 > pcs-0.9.169-3.el7.centos.3.x86_64 > pacemaker-libs-1.1.23-1.el7_9.1.x86_64 > >> >> >> > I'm performing some cluster failover tests in a 3 node cluster. We have >> 3 >> > resources in the cluster. >> > I was trying to see if I could get it working if 2 nodes fail at >> different >> > times. I'd like the 3 resources to then run on one node. >> > >> > The quorum options I've configured are as follows >> > [root@node1 ~]# pcs quorum config >> > Options: >> > auto_tie_breaker: 1 >> > last_man_standing: 1 >> > last_man_standing_window: 1 >> > wait_for_all: 1 >> > >> > >> Not sure if the combination of auto_tie_breaker and last_man_standing >> makes >> sense. >> And as you have a cluster with an odd number of nodes auto_tie_breaker >> should be >> disabled anyway I guess. >> > Ah ok I'll try removing auto_tie_breaker and leave last_man_standing > >> >> >> > [root@node1 ~]# pcs quorum status >> > Quorum information >> > -- >> > Date: Wed Aug 30 11:20:04 2023 >> > Quorum provider: corosync_votequorum >> > Nodes:3 >> > Node ID: 1 >> > Ring ID: 1/1538 >> > Quorate: Yes >> > >> > Votequorum information >> > -- >> > Expected votes: 3 >> > Highest expected: 3 >> > Total votes: 3 >> > Quorum: 2 >> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker >> > >> > Membership information >> > -- >> > Nodeid VotesQdevice Name >> > 1 1 NR node1 (local) >> > 2 1 NR node2 >> > 3 1 NR node3 >> > >> > If I stop the cluster services on node 2 and 3, the groups all failover >> to >> > node 1 since it is the node with the lowest ID >> > But if I stop them on node1 and node 2 or node1 and node3, the cluster >> > fails. >> > >> > I tried adding this line to corosync.conf and I could then bring down >> the >> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until >> last, >> > the cluster failed >> > auto_tie_breaker_node: 1 3 >> > >> > This line had the same outcome as using 1 3 >> > auto_tie_breaker_node: 1 2 3 >> > >> > >> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather >> sounds dangerous if that configuration is possible at all. >> >> Maybe the misbehavior of last_man_standing is due to this (maybe not >> recognized) misconfiguration. >> Did you wait long enough between letting the 2 nodes fail? >> > I've done it so many times so I believe so. But I'll try remove the > auto_tie_breaker config, leaving the last_man_standing. I'll also make sure > I leave a couple of minutes between bringing down the nodes and post back. > Just confirming I removed the auto_tie_breaker config and tested. Quorum configuration is as follows: Options: last_man_standing: 1 last_man_standing_window: 1 wait_for_all: 1 I waited 2-3 minutes between stopping cluster services on two nodes via pcs cluster stop The remaining cluster node is then fenced. I was hoping the remaining node would stay online running the resources. >> Klaus >> >> >> > So I'd like it to failover when any combination of two nodes fail but >> I've >> > only had success when the middle node isn't last. >> > >> > Thanks >> > David >> >> >> >> ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
On 30.08.2023 19:23, David Dolan wrote: Use fencing. Quorum is not a replacement for fencing. With (reliable) fencing you can simply run pacemaker with no-quorum-policy=ignore. The practical problem is that usually the last resort that will work in all cases is SBD + suicide and SBD cannot work without quorum. Ah I forgot to mention I do have fencing setup, which connects to Vmware Virtualcenter. Do you think it's safe to set that no-quorum-policy=ignore? fencing is always safe. fencing guarantees that when nodes take over resources of a missing node, the missing node is actually not running any of these resources. Yes, if fencing fails resource won't be taken over but usually it is better than possible corruption. Quorum is entirely orthogonal to that. If your two nodes lost connection to the third node, they will happily take over resources whether the third node already stopped them or not. If you actually mean "is it guaranteed that the survived node will always be able to take over resources from other nodes" - no, it depends on network connectivity, if connection to VC is lost (or if anything bad happens during communication with VC, like somebody changed password you use) fencing will fail and resources won't be taken over. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
> Hi All, > > > > I'm running Pacemaker on Centos7 > > Name: pcs > > Version : 0.9.169 > > Release : 3.el7.centos.3 > > Architecture: x86_64 > > > > > Besides the pcs-version versions of the other cluster-stack-components > could be interesting. (pacemaker, corosync) > rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents" fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64 corosynclib-2.4.5-7.el7_9.2.x86_64 pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64 fence-agents-common-4.2.1-41.el7_9.6.x86_64 corosync-2.4.5-7.el7_9.2.x86_64 pacemaker-cli-1.1.23-1.el7_9.1.x86_64 pacemaker-1.1.23-1.el7_9.1.x86_64 pcs-0.9.169-3.el7.centos.3.x86_64 pacemaker-libs-1.1.23-1.el7_9.1.x86_64 > > > > I'm performing some cluster failover tests in a 3 node cluster. We have 3 > > resources in the cluster. > > I was trying to see if I could get it working if 2 nodes fail at > different > > times. I'd like the 3 resources to then run on one node. > > > > The quorum options I've configured are as follows > > [root@node1 ~]# pcs quorum config > > Options: > > auto_tie_breaker: 1 > > last_man_standing: 1 > > last_man_standing_window: 1 > > wait_for_all: 1 > > > > > Not sure if the combination of auto_tie_breaker and last_man_standing makes > sense. > And as you have a cluster with an odd number of nodes auto_tie_breaker > should be > disabled anyway I guess. > Ah ok I'll try removing auto_tie_breaker and leave last_man_standing > > > > [root@node1 ~]# pcs quorum status > > Quorum information > > -- > > Date: Wed Aug 30 11:20:04 2023 > > Quorum provider: corosync_votequorum > > Nodes:3 > > Node ID: 1 > > Ring ID: 1/1538 > > Quorate: Yes > > > > Votequorum information > > -- > > Expected votes: 3 > > Highest expected: 3 > > Total votes: 3 > > Quorum: 2 > > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker > > > > Membership information > > -- > > Nodeid VotesQdevice Name > > 1 1 NR node1 (local) > > 2 1 NR node2 > > 3 1 NR node3 > > > > If I stop the cluster services on node 2 and 3, the groups all failover > to > > node 1 since it is the node with the lowest ID > > But if I stop them on node1 and node 2 or node1 and node3, the cluster > > fails. > > > > I tried adding this line to corosync.conf and I could then bring down the > > services on node 1 and 2 or node 2 and 3 but if I left node 2 until last, > > the cluster failed > > auto_tie_breaker_node: 1 3 > > > > This line had the same outcome as using 1 3 > > auto_tie_breaker_node: 1 2 3 > > > > > Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather > sounds dangerous if that configuration is possible at all. > > Maybe the misbehavior of last_man_standing is due to this (maybe not > recognized) misconfiguration. > Did you wait long enough between letting the 2 nodes fail? > I've done it so many times so I believe so. But I'll try remove the auto_tie_breaker config, leaving the last_man_standing. I'll also make sure I leave a couple of minutes between bringing down the nodes and post back. > > Klaus > > > > So I'd like it to failover when any combination of two nodes fail but > I've > > only had success when the middle node isn't last. > > > > Thanks > > David > > > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
> > > > > Hi All, > > > > I'm running Pacemaker on Centos7 > > Name: pcs > > Version : 0.9.169 > > Release : 3.el7.centos.3 > > Architecture: x86_64 > > > > > > I'm performing some cluster failover tests in a 3 node cluster. We have > 3 resources in the cluster. > > I was trying to see if I could get it working if 2 nodes fail at > different times. I'd like the 3 resources to then run on one node. > > > > The quorum options I've configured are as follows > > [root@node1 ~]# pcs quorum config > > Options: > > auto_tie_breaker: 1 > > last_man_standing: 1 > > last_man_standing_window: 1 > > wait_for_all: 1 > > > > [root@node1 ~]# pcs quorum status > > Quorum information > > -- > > Date: Wed Aug 30 11:20:04 2023 > > Quorum provider: corosync_votequorum > > Nodes:3 > > Node ID: 1 > > Ring ID: 1/1538 > > Quorate: Yes > > > > Votequorum information > > -- > > Expected votes: 3 > > Highest expected: 3 > > Total votes: 3 > > Quorum: 2 > > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker > > > > Membership information > > -- > > Nodeid VotesQdevice Name > > 1 1 NR node1 (local) > > 2 1 NR node2 > > 3 1 NR node3 > > > > If I stop the cluster services on node 2 and 3, the groups all failover > to node 1 since it is the node with the lowest ID > > But if I stop them on node1 and node 2 or node1 and node3, the cluster > fails. > > > > I tried adding this line to corosync.conf and I could then bring down > the services on node 1 and 2 or node 2 and 3 but if I left node 2 until > last, the cluster failed > > auto_tie_breaker_node: 1 3 > > > > This line had the same outcome as using 1 3 > > auto_tie_breaker_node: 1 2 3 > > > > So I'd like it to failover when any combination of two nodes fail but > I've only had success when the middle node isn't last. > > > > Use fencing. Quorum is not a replacement for fencing. With (reliable) > fencing you can simply run pacemaker with no-quorum-policy=ignore. > > The practical problem is that usually the last resort that will work > in all cases is SBD + suicide and SBD cannot work without quorum. > > Ah I forgot to mention I do have fencing setup, which connects to Vmware Virtualcenter. Do you think it's safe to set that no-quorum-policy=ignore? Thanks David > > > ** > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
On Wed, Aug 30, 2023 at 2:34 PM David Dolan wrote: > Hi All, > > I'm running Pacemaker on Centos7 > Name: pcs > Version : 0.9.169 > Release : 3.el7.centos.3 > Architecture: x86_64 > > Besides the pcs-version versions of the other cluster-stack-components could be interesting. (pacemaker, corosync) > I'm performing some cluster failover tests in a 3 node cluster. We have 3 > resources in the cluster. > I was trying to see if I could get it working if 2 nodes fail at different > times. I'd like the 3 resources to then run on one node. > > The quorum options I've configured are as follows > [root@node1 ~]# pcs quorum config > Options: > auto_tie_breaker: 1 > last_man_standing: 1 > last_man_standing_window: 1 > wait_for_all: 1 > > Not sure if the combination of auto_tie_breaker and last_man_standing makes sense. And as you have a cluster with an odd number of nodes auto_tie_breaker should be disabled anyway I guess. > [root@node1 ~]# pcs quorum status > Quorum information > -- > Date: Wed Aug 30 11:20:04 2023 > Quorum provider: corosync_votequorum > Nodes:3 > Node ID: 1 > Ring ID: 1/1538 > Quorate: Yes > > Votequorum information > -- > Expected votes: 3 > Highest expected: 3 > Total votes: 3 > Quorum: 2 > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker > > Membership information > -- > Nodeid VotesQdevice Name > 1 1 NR node1 (local) > 2 1 NR node2 > 3 1 NR node3 > > If I stop the cluster services on node 2 and 3, the groups all failover to > node 1 since it is the node with the lowest ID > But if I stop them on node1 and node 2 or node1 and node3, the cluster > fails. > > I tried adding this line to corosync.conf and I could then bring down the > services on node 1 and 2 or node 2 and 3 but if I left node 2 until last, > the cluster failed > auto_tie_breaker_node: 1 3 > > This line had the same outcome as using 1 3 > auto_tie_breaker_node: 1 2 3 > > Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather sounds dangerous if that configuration is possible at all. Maybe the misbehavior of last_man_standing is due to this (maybe not recognized) misconfiguration. Did you wait long enough between letting the 2 nodes fail? Klaus > So I'd like it to failover when any combination of two nodes fail but I've > only had success when the middle node isn't last. > > Thanks > David > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] issue during Pacemaker failover testing
On Wed, Aug 30, 2023 at 3:34 PM David Dolan wrote: > > Hi All, > > I'm running Pacemaker on Centos7 > Name: pcs > Version : 0.9.169 > Release : 3.el7.centos.3 > Architecture: x86_64 > > > I'm performing some cluster failover tests in a 3 node cluster. We have 3 > resources in the cluster. > I was trying to see if I could get it working if 2 nodes fail at different > times. I'd like the 3 resources to then run on one node. > > The quorum options I've configured are as follows > [root@node1 ~]# pcs quorum config > Options: > auto_tie_breaker: 1 > last_man_standing: 1 > last_man_standing_window: 1 > wait_for_all: 1 > > [root@node1 ~]# pcs quorum status > Quorum information > -- > Date: Wed Aug 30 11:20:04 2023 > Quorum provider: corosync_votequorum > Nodes:3 > Node ID: 1 > Ring ID: 1/1538 > Quorate: Yes > > Votequorum information > -- > Expected votes: 3 > Highest expected: 3 > Total votes: 3 > Quorum: 2 > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker > > Membership information > -- > Nodeid VotesQdevice Name > 1 1 NR node1 (local) > 2 1 NR node2 > 3 1 NR node3 > > If I stop the cluster services on node 2 and 3, the groups all failover to > node 1 since it is the node with the lowest ID > But if I stop them on node1 and node 2 or node1 and node3, the cluster fails. > > I tried adding this line to corosync.conf and I could then bring down the > services on node 1 and 2 or node 2 and 3 but if I left node 2 until last, the > cluster failed > auto_tie_breaker_node: 1 3 > > This line had the same outcome as using 1 3 > auto_tie_breaker_node: 1 2 3 > > So I'd like it to failover when any combination of two nodes fail but I've > only had success when the middle node isn't last. > Use fencing. Quorum is not a replacement for fencing. With (reliable) fencing you can simply run pacemaker with no-quorum-policy=ignore. The practical problem is that usually the last resort that will work in all cases is SBD + suicide and SBD cannot work without quorum. ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] issue during Pacemaker failover testing
Hi All, I'm running Pacemaker on Centos7 Name: pcs Version : 0.9.169 Release : 3.el7.centos.3 Architecture: x86_64 I'm performing some cluster failover tests in a 3 node cluster. We have 3 resources in the cluster. I was trying to see if I could get it working if 2 nodes fail at different times. I'd like the 3 resources to then run on one node. The quorum options I've configured are as follows [root@node1 ~]# pcs quorum config Options: auto_tie_breaker: 1 last_man_standing: 1 last_man_standing_window: 1 wait_for_all: 1 [root@node1 ~]# pcs quorum status Quorum information -- Date: Wed Aug 30 11:20:04 2023 Quorum provider: corosync_votequorum Nodes:3 Node ID: 1 Ring ID: 1/1538 Quorate: Yes Votequorum information -- Expected votes: 3 Highest expected: 3 Total votes: 3 Quorum: 2 Flags:Quorate WaitForAll LastManStanding AutoTieBreaker Membership information -- Nodeid VotesQdevice Name 1 1 NR node1 (local) 2 1 NR node2 3 1 NR node3 If I stop the cluster services on node 2 and 3, the groups all failover to node 1 since it is the node with the lowest ID But if I stop them on node1 and node 2 or node1 and node3, the cluster fails. I tried adding this line to corosync.conf and I could then bring down the services on node 1 and 2 or node 2 and 3 but if I left node 2 until last, the cluster failed auto_tie_breaker_node: 1 3 This line had the same outcome as using 1 3 auto_tie_breaker_node: 1 2 3 So I'd like it to failover when any combination of two nodes fail but I've only had success when the middle node isn't last. Thanks David ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/