Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-31 Thread David Dolan
I just tried removing all the quorum options setting back to defaults so no
last_man_standing or wait_for_all.
I still see the same behaviour where the third node is fenced if I bring
down services on two nodes.
Thanks
David

On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger  wrote:

>
>
> On Thu, Aug 31, 2023 at 12:28 PM David Dolan 
> wrote:
>
>>
>>
>> On Wed, 30 Aug 2023 at 17:35, David Dolan  wrote:
>>
>>>
>>>
>>> > Hi All,
 >
 > I'm running Pacemaker on Centos7
 > Name: pcs
 > Version : 0.9.169
 > Release : 3.el7.centos.3
 > Architecture: x86_64
 >
 >
 Besides the pcs-version versions of the other cluster-stack-components
 could be interesting. (pacemaker, corosync)

>>>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
>>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
>>> corosynclib-2.4.5-7.el7_9.2.x86_64
>>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
>>> fence-agents-common-4.2.1-41.el7_9.6.x86_64
>>> corosync-2.4.5-7.el7_9.2.x86_64
>>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
>>> pacemaker-1.1.23-1.el7_9.1.x86_64
>>> pcs-0.9.169-3.el7.centos.3.x86_64
>>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>>


 > I'm performing some cluster failover tests in a 3 node cluster. We
 have 3
 > resources in the cluster.
 > I was trying to see if I could get it working if 2 nodes fail at
 different
 > times. I'd like the 3 resources to then run on one node.
 >
 > The quorum options I've configured are as follows
 > [root@node1 ~]# pcs quorum config
 > Options:
 >   auto_tie_breaker: 1
 >   last_man_standing: 1
 >   last_man_standing_window: 1
 >   wait_for_all: 1
 >
 >
 Not sure if the combination of auto_tie_breaker and last_man_standing
 makes
 sense.
 And as you have a cluster with an odd number of nodes auto_tie_breaker
 should be
 disabled anyway I guess.

>>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>>


 > [root@node1 ~]# pcs quorum status
 > Quorum information
 > --
 > Date: Wed Aug 30 11:20:04 2023
 > Quorum provider:  corosync_votequorum
 > Nodes:3
 > Node ID:  1
 > Ring ID:  1/1538
 > Quorate:  Yes
 >
 > Votequorum information
 > --
 > Expected votes:   3
 > Highest expected: 3
 > Total votes:  3
 > Quorum:   2
 > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
 >
 > Membership information
 > --
 > Nodeid  VotesQdevice Name
 >  1  1 NR node1 (local)
 >  2  1 NR node2
 >  3  1 NR node3
 >
 > If I stop the cluster services on node 2 and 3, the groups all
 failover to
 > node 1 since it is the node with the lowest ID
 > But if I stop them on node1 and node 2 or node1 and node3, the cluster
 > fails.
 >
 > I tried adding this line to corosync.conf and I could then bring down
 the
 > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
 last,
 > the cluster failed
 > auto_tie_breaker_node: 1  3
 >
 > This line had the same outcome as using 1 3
 > auto_tie_breaker_node: 1  2 3
 >
 >
 Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
 rather
 sounds dangerous if that configuration is possible at all.

 Maybe the misbehavior of last_man_standing is due to this (maybe not
 recognized) misconfiguration.
 Did you wait long enough between letting the 2 nodes fail?

>>> I've done it so many times so I believe so. But I'll try remove the
>>> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
>>> I leave a couple of minutes between bringing down the nodes and post back.
>>>
>> Just confirming I removed the auto_tie_breaker config and tested. Quorum
>> configuration is as follows:
>>  Options:
>>   last_man_standing: 1
>>   last_man_standing_window: 1
>>   wait_for_all: 1
>>
>> I waited 2-3 minutes between stopping cluster services on two nodes via
>> pcs cluster stop
>> The remaining cluster node is then fenced. I was hoping the remaining
>> node would stay online running the resources.
>>
>
> Yep - that would've been my understanding as well.
> But honestly I've never used last_man_standing in this context - wasn't
> even aware that it was
> offered without qdevice nor have I checked how it is implemented.
>
> Klaus
>
>>
>>
 Klaus


 > So I'd like it to failover when any combination of two nodes fail but
 I've
 > only had success when the middle node isn't last.
 >
 > Thanks
 > David




___
Manage your subscription:
https://lists.clusterlabs.org/ma

Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-31 Thread Klaus Wenninger
On Thu, Aug 31, 2023 at 12:28 PM David Dolan  wrote:

>
>
> On Wed, 30 Aug 2023 at 17:35, David Dolan  wrote:
>
>>
>>
>> > Hi All,
>>> >
>>> > I'm running Pacemaker on Centos7
>>> > Name: pcs
>>> > Version : 0.9.169
>>> > Release : 3.el7.centos.3
>>> > Architecture: x86_64
>>> >
>>> >
>>> Besides the pcs-version versions of the other cluster-stack-components
>>> could be interesting. (pacemaker, corosync)
>>>
>>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
>> corosynclib-2.4.5-7.el7_9.2.x86_64
>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
>> fence-agents-common-4.2.1-41.el7_9.6.x86_64
>> corosync-2.4.5-7.el7_9.2.x86_64
>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
>> pacemaker-1.1.23-1.el7_9.1.x86_64
>> pcs-0.9.169-3.el7.centos.3.x86_64
>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>
>>>
>>>
>>> > I'm performing some cluster failover tests in a 3 node cluster. We
>>> have 3
>>> > resources in the cluster.
>>> > I was trying to see if I could get it working if 2 nodes fail at
>>> different
>>> > times. I'd like the 3 resources to then run on one node.
>>> >
>>> > The quorum options I've configured are as follows
>>> > [root@node1 ~]# pcs quorum config
>>> > Options:
>>> >   auto_tie_breaker: 1
>>> >   last_man_standing: 1
>>> >   last_man_standing_window: 1
>>> >   wait_for_all: 1
>>> >
>>> >
>>> Not sure if the combination of auto_tie_breaker and last_man_standing
>>> makes
>>> sense.
>>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>>> should be
>>> disabled anyway I guess.
>>>
>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>
>>>
>>>
>>> > [root@node1 ~]# pcs quorum status
>>> > Quorum information
>>> > --
>>> > Date: Wed Aug 30 11:20:04 2023
>>> > Quorum provider:  corosync_votequorum
>>> > Nodes:3
>>> > Node ID:  1
>>> > Ring ID:  1/1538
>>> > Quorate:  Yes
>>> >
>>> > Votequorum information
>>> > --
>>> > Expected votes:   3
>>> > Highest expected: 3
>>> > Total votes:  3
>>> > Quorum:   2
>>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>>> >
>>> > Membership information
>>> > --
>>> > Nodeid  VotesQdevice Name
>>> >  1  1 NR node1 (local)
>>> >  2  1 NR node2
>>> >  3  1 NR node3
>>> >
>>> > If I stop the cluster services on node 2 and 3, the groups all
>>> failover to
>>> > node 1 since it is the node with the lowest ID
>>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
>>> > fails.
>>> >
>>> > I tried adding this line to corosync.conf and I could then bring down
>>> the
>>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
>>> last,
>>> > the cluster failed
>>> > auto_tie_breaker_node: 1  3
>>> >
>>> > This line had the same outcome as using 1 3
>>> > auto_tie_breaker_node: 1  2 3
>>> >
>>> >
>>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
>>> rather
>>> sounds dangerous if that configuration is possible at all.
>>>
>>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>>> recognized) misconfiguration.
>>> Did you wait long enough between letting the 2 nodes fail?
>>>
>> I've done it so many times so I believe so. But I'll try remove the
>> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
>> I leave a couple of minutes between bringing down the nodes and post back.
>>
> Just confirming I removed the auto_tie_breaker config and tested. Quorum
> configuration is as follows:
>  Options:
>   last_man_standing: 1
>   last_man_standing_window: 1
>   wait_for_all: 1
>
> I waited 2-3 minutes between stopping cluster services on two nodes via
> pcs cluster stop
> The remaining cluster node is then fenced. I was hoping the remaining node
> would stay online running the resources.
>

Yep - that would've been my understanding as well.
But honestly I've never used last_man_standing in this context - wasn't
even aware that it was
offered without qdevice nor have I checked how it is implemented.

Klaus

>
>
>>> Klaus
>>>
>>>
>>> > So I'd like it to failover when any combination of two nodes fail but
>>> I've
>>> > only had success when the middle node isn't last.
>>> >
>>> > Thanks
>>> > David
>>>
>>>
>>>
>>>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-31 Thread David Dolan
On Wed, 30 Aug 2023 at 17:35, David Dolan  wrote:

>
>
> > Hi All,
>> >
>> > I'm running Pacemaker on Centos7
>> > Name: pcs
>> > Version : 0.9.169
>> > Release : 3.el7.centos.3
>> > Architecture: x86_64
>> >
>> >
>> Besides the pcs-version versions of the other cluster-stack-components
>> could be interesting. (pacemaker, corosync)
>>
>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
> corosynclib-2.4.5-7.el7_9.2.x86_64
> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
> fence-agents-common-4.2.1-41.el7_9.6.x86_64
> corosync-2.4.5-7.el7_9.2.x86_64
> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
> pacemaker-1.1.23-1.el7_9.1.x86_64
> pcs-0.9.169-3.el7.centos.3.x86_64
> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>
>>
>>
>> > I'm performing some cluster failover tests in a 3 node cluster. We have
>> 3
>> > resources in the cluster.
>> > I was trying to see if I could get it working if 2 nodes fail at
>> different
>> > times. I'd like the 3 resources to then run on one node.
>> >
>> > The quorum options I've configured are as follows
>> > [root@node1 ~]# pcs quorum config
>> > Options:
>> >   auto_tie_breaker: 1
>> >   last_man_standing: 1
>> >   last_man_standing_window: 1
>> >   wait_for_all: 1
>> >
>> >
>> Not sure if the combination of auto_tie_breaker and last_man_standing
>> makes
>> sense.
>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>> should be
>> disabled anyway I guess.
>>
> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>
>>
>>
>> > [root@node1 ~]# pcs quorum status
>> > Quorum information
>> > --
>> > Date: Wed Aug 30 11:20:04 2023
>> > Quorum provider:  corosync_votequorum
>> > Nodes:3
>> > Node ID:  1
>> > Ring ID:  1/1538
>> > Quorate:  Yes
>> >
>> > Votequorum information
>> > --
>> > Expected votes:   3
>> > Highest expected: 3
>> > Total votes:  3
>> > Quorum:   2
>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>> >
>> > Membership information
>> > --
>> > Nodeid  VotesQdevice Name
>> >  1  1 NR node1 (local)
>> >  2  1 NR node2
>> >  3  1 NR node3
>> >
>> > If I stop the cluster services on node 2 and 3, the groups all failover
>> to
>> > node 1 since it is the node with the lowest ID
>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
>> > fails.
>> >
>> > I tried adding this line to corosync.conf and I could then bring down
>> the
>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
>> last,
>> > the cluster failed
>> > auto_tie_breaker_node: 1  3
>> >
>> > This line had the same outcome as using 1 3
>> > auto_tie_breaker_node: 1  2 3
>> >
>> >
>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather
>> sounds dangerous if that configuration is possible at all.
>>
>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>> recognized) misconfiguration.
>> Did you wait long enough between letting the 2 nodes fail?
>>
> I've done it so many times so I believe so. But I'll try remove the
> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
> I leave a couple of minutes between bringing down the nodes and post back.
>
Just confirming I removed the auto_tie_breaker config and tested. Quorum
configuration is as follows:
 Options:
  last_man_standing: 1
  last_man_standing_window: 1
  wait_for_all: 1

I waited 2-3 minutes between stopping cluster services on two nodes via pcs
cluster stop
The remaining cluster node is then fenced. I was hoping the remaining node
would stay online running the resources.


>> Klaus
>>
>>
>> > So I'd like it to failover when any combination of two nodes fail but
>> I've
>> > only had success when the middle node isn't last.
>> >
>> > Thanks
>> > David
>>
>>
>>
>>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/