Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Andrei Borzenkov
On Mon, Sep 4, 2023 at 4:44 PM David Dolan  wrote:
>
> Thanks Klaus\Andrei,
>
> So if I understand correctly what I'm trying probably shouldn't work.

It is impossible to configure corosync (or any other cluster system
for that matter) to keep the *arbitrary* last node quorate. It is
possible to designate one node as "preferred" and to keep it quorate.
Returning to your example:

> I tried adding this line to corosync.conf and I could then bring down the 
> services on node 1 and 2 or node 2 and 3 but if I left node 2 until last, the 
> cluster failed
> auto_tie_breaker_node: 1  3
>

Correct. In your scenario the tie breaker is only relevant with two
nodes. When the first node is down, the remaining two nodes select the
tiebreaker. It can only be node 1 or 3.

> This line had the same outcome as using 1 3
> auto_tie_breaker_node: 1  2 3

If it really has the same outcome (i.e. cluster fails when node 2 is
left) it is a bug. This line makes nodes 1 or 2 a possible tiebreaker.
So the cluster must fail if node 3 is left, not node 2.

What most certainly *is* possible - no-quorum-policy=ignore + reliable
fencing. This worked just fine in two node clusters without two_node.
It does not make the last node quorate, but it allows pacemaker to
continue providing services on this node *and* taking over services
from other nodes if they were fenced successfully.

> And I should attempt setting auto_tie_breaker in corosync and remove 
> last_man_standing.
> Then, I should set up another server with qdevice and configure that using 
> the LMS algorithm.
>
> Thanks
> David
>
> On Mon, 4 Sept 2023 at 13:32, Klaus Wenninger  wrote:
>>
>>
>>
>> On Mon, Sep 4, 2023 at 1:50 PM Andrei Borzenkov  wrote:
>>>
>>> On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger  wrote:
>>> >
>>> >
>>> >
>>> > On Mon, Sep 4, 2023 at 12:45 PM David Dolan  wrote:
>>> >>
>>> >> Hi Klaus,
>>> >>
>>> >> With default quorum options I've performed the following on my 3 node 
>>> >> cluster
>>> >>
>>> >> Bring down cluster services on one node - the running services migrate 
>>> >> to another node
>>> >> Wait 3 minutes
>>> >> Bring down cluster services on one of the two remaining nodes - the 
>>> >> surviving node in the cluster is then fenced
>>> >>
>>> >> Instead of the surviving node being fenced, I hoped that the services 
>>> >> would migrate and run on that remaining node.
>>> >>
>>> >> Just looking for confirmation that my understanding is ok and if I'm 
>>> >> missing something?
>>> >
>>> >
>>> > As said I've never used it ...
>>> > Well when down to 2 nodes LMS per definition is getting into trouble as 
>>> > after another
>>> > outage any of them is gonna be alone. In case of an ordered shutdown this 
>>> > could
>>> > possibly be circumvented though. So I guess your fist attempt to enable 
>>> > auto-tie-breaker
>>> > was the right idea. Like this you will have further service at least on 
>>> > one of the nodes.
>>> > So I guess what you were seeing is the right - and unfortunately only 
>>> > possible - behavior.
>>>
>>> I still do not see where fencing comes from. Pacemaker requests
>>> fencing of the missing nodes. It also may request self-fencing, but
>>> not in the default settings. It is rather hard to tell what happens
>>> without logs from the last remaining node.
>>>
>>> That said, the default action is to stop all resources, so the end
>>> result is not very different :)
>>
>>
>> But you are of course right. The expected behaviour would be that
>> the leftover node stops the resources.
>> But maybe we're missing something here. Hard to tell without
>> the exact configuration including fencing.
>> Again, as already said, I don't know anything about the LMS
>> implementation with corosync. In theory there were both arguments
>> to either suicide (but that would have to be done by pacemaker) or
>> to automatically switch to some 2-node-mode once the remaining
>> partition is reduced to just 2 followed by a fence-race (when done
>> without the precautions otherwise used for 2-node-clusters).
>> But I guess in this case it is none of those 2.
>>
>> Klaus
>>>
>>> ___
>>> Manage your subscription:
>>> https://lists.clusterlabs.org/mailman/listinfo/users
>>>
>>> ClusterLabs home: https://www.clusterlabs.org/
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread David Dolan
Thanks Klaus\Andrei,

So if I understand correctly what I'm trying probably shouldn't work.
And I should attempt setting auto_tie_breaker in corosync and remove
last_man_standing.
Then, I should set up another server with qdevice and configure that using
the LMS algorithm.

Thanks
David

On Mon, 4 Sept 2023 at 13:32, Klaus Wenninger  wrote:

>
>
> On Mon, Sep 4, 2023 at 1:50 PM Andrei Borzenkov 
> wrote:
>
>> On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger 
>> wrote:
>> >
>> >
>> >
>> > On Mon, Sep 4, 2023 at 12:45 PM David Dolan 
>> wrote:
>> >>
>> >> Hi Klaus,
>> >>
>> >> With default quorum options I've performed the following on my 3 node
>> cluster
>> >>
>> >> Bring down cluster services on one node - the running services migrate
>> to another node
>> >> Wait 3 minutes
>> >> Bring down cluster services on one of the two remaining nodes - the
>> surviving node in the cluster is then fenced
>> >>
>> >> Instead of the surviving node being fenced, I hoped that the services
>> would migrate and run on that remaining node.
>> >>
>> >> Just looking for confirmation that my understanding is ok and if I'm
>> missing something?
>> >
>> >
>> > As said I've never used it ...
>> > Well when down to 2 nodes LMS per definition is getting into trouble as
>> after another
>> > outage any of them is gonna be alone. In case of an ordered shutdown
>> this could
>> > possibly be circumvented though. So I guess your fist attempt to enable
>> auto-tie-breaker
>> > was the right idea. Like this you will have further service at least on
>> one of the nodes.
>> > So I guess what you were seeing is the right - and unfortunately only
>> possible - behavior.
>>
>> I still do not see where fencing comes from. Pacemaker requests
>> fencing of the missing nodes. It also may request self-fencing, but
>> not in the default settings. It is rather hard to tell what happens
>> without logs from the last remaining node.
>>
>> That said, the default action is to stop all resources, so the end
>> result is not very different :)
>>
>
> But you are of course right. The expected behaviour would be that
> the leftover node stops the resources.
> But maybe we're missing something here. Hard to tell without
> the exact configuration including fencing.
> Again, as already said, I don't know anything about the LMS
> implementation with corosync. In theory there were both arguments
> to either suicide (but that would have to be done by pacemaker) or
> to automatically switch to some 2-node-mode once the remaining
> partition is reduced to just 2 followed by a fence-race (when done
> without the precautions otherwise used for 2-node-clusters).
> But I guess in this case it is none of those 2.
>
> Klaus
>
>> ___
>> Manage your subscription:
>> https://lists.clusterlabs.org/mailman/listinfo/users
>>
>> ClusterLabs home: https://www.clusterlabs.org/
>>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Klaus Wenninger
On Mon, Sep 4, 2023 at 1:50 PM Andrei Borzenkov  wrote:

> On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger 
> wrote:
> >
> >
> >
> > On Mon, Sep 4, 2023 at 12:45 PM David Dolan 
> wrote:
> >>
> >> Hi Klaus,
> >>
> >> With default quorum options I've performed the following on my 3 node
> cluster
> >>
> >> Bring down cluster services on one node - the running services migrate
> to another node
> >> Wait 3 minutes
> >> Bring down cluster services on one of the two remaining nodes - the
> surviving node in the cluster is then fenced
> >>
> >> Instead of the surviving node being fenced, I hoped that the services
> would migrate and run on that remaining node.
> >>
> >> Just looking for confirmation that my understanding is ok and if I'm
> missing something?
> >
> >
> > As said I've never used it ...
> > Well when down to 2 nodes LMS per definition is getting into trouble as
> after another
> > outage any of them is gonna be alone. In case of an ordered shutdown
> this could
> > possibly be circumvented though. So I guess your fist attempt to enable
> auto-tie-breaker
> > was the right idea. Like this you will have further service at least on
> one of the nodes.
> > So I guess what you were seeing is the right - and unfortunately only
> possible - behavior.
>
> I still do not see where fencing comes from. Pacemaker requests
> fencing of the missing nodes. It also may request self-fencing, but
> not in the default settings. It is rather hard to tell what happens
> without logs from the last remaining node.
>
> That said, the default action is to stop all resources, so the end
> result is not very different :)
>

But you are of course right. The expected behaviour would be that
the leftover node stops the resources.
But maybe we're missing something here. Hard to tell without
the exact configuration including fencing.
Again, as already said, I don't know anything about the LMS
implementation with corosync. In theory there were both arguments
to either suicide (but that would have to be done by pacemaker) or
to automatically switch to some 2-node-mode once the remaining
partition is reduced to just 2 followed by a fence-race (when done
without the precautions otherwise used for 2-node-clusters).
But I guess in this case it is none of those 2.

Klaus

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Klaus Wenninger
On Mon, Sep 4, 2023 at 1:44 PM Andrei Borzenkov  wrote:

> On Mon, Sep 4, 2023 at 2:25 PM Klaus Wenninger 
> wrote:
> >
> >
> > Or go for qdevice with LMS where I would expect it to be able to really
> go down to
> > a single node left - any of the 2 last ones - as there is still qdevice.#
> > Sry for the confusion btw.
> >
>
> According to documentation, "LMS is also incompatible with quorum
> devices, if last_man_standing is specified in corosync.conf then the
> quorum device will be disabled".
>

That is why I said qdevice with LMS - but it was probably not explicit
enough without telling that I meant the qdevice algorithm and not
the corosync flag.

Klaus

> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Andrei Borzenkov
On Mon, Sep 4, 2023 at 2:18 PM Klaus Wenninger  wrote:
>
>
>
> On Mon, Sep 4, 2023 at 12:45 PM David Dolan  wrote:
>>
>> Hi Klaus,
>>
>> With default quorum options I've performed the following on my 3 node cluster
>>
>> Bring down cluster services on one node - the running services migrate to 
>> another node
>> Wait 3 minutes
>> Bring down cluster services on one of the two remaining nodes - the 
>> surviving node in the cluster is then fenced
>>
>> Instead of the surviving node being fenced, I hoped that the services would 
>> migrate and run on that remaining node.
>>
>> Just looking for confirmation that my understanding is ok and if I'm missing 
>> something?
>
>
> As said I've never used it ...
> Well when down to 2 nodes LMS per definition is getting into trouble as after 
> another
> outage any of them is gonna be alone. In case of an ordered shutdown this 
> could
> possibly be circumvented though. So I guess your fist attempt to enable 
> auto-tie-breaker
> was the right idea. Like this you will have further service at least on one 
> of the nodes.
> So I guess what you were seeing is the right - and unfortunately only 
> possible - behavior.

I still do not see where fencing comes from. Pacemaker requests
fencing of the missing nodes. It also may request self-fencing, but
not in the default settings. It is rather hard to tell what happens
without logs from the last remaining node.

That said, the default action is to stop all resources, so the end
result is not very different :)
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Andrei Borzenkov
On Mon, Sep 4, 2023 at 2:25 PM Klaus Wenninger  wrote:
>
>
> Or go for qdevice with LMS where I would expect it to be able to really go 
> down to
> a single node left - any of the 2 last ones - as there is still qdevice.#
> Sry for the confusion btw.
>

According to documentation, "LMS is also incompatible with quorum
devices, if last_man_standing is specified in corosync.conf then the
quorum device will be disabled".
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Klaus Wenninger
On Mon, Sep 4, 2023 at 1:18 PM Klaus Wenninger  wrote:

>
>
> On Mon, Sep 4, 2023 at 12:45 PM David Dolan  wrote:
>
>> Hi Klaus,
>>
>> With default quorum options I've performed the following on my 3 node
>> cluster
>>
>> Bring down cluster services on one node - the running services migrate to
>> another node
>> Wait 3 minutes
>> Bring down cluster services on one of the two remaining nodes - the
>> surviving node in the cluster is then fenced
>>
>> Instead of the surviving node being fenced, I hoped that the services
>> would migrate and run on that remaining node.
>>
>> Just looking for confirmation that my understanding is ok and if I'm
>> missing something?
>>
>
> As said I've never used it ...
> Well when down to 2 nodes LMS per definition is getting into trouble as
> after another
> outage any of them is gonna be alone. In case of an ordered shutdown this
> could
> possibly be circumvented though. So I guess your fist attempt to enable
> auto-tie-breaker
> was the right idea. Like this you will have further service at least on
> one of the nodes.
> So I guess what you were seeing is the right - and unfortunately only
> possible - behavior.
> Where LMS shines is probably scenarios with substantially more nodes.
>

Or go for qdevice with LMS where I would expect it to be able to really go
down to
a single node left - any of the 2 last ones - as there is still qdevice.#
Sry for the confusion btw.

Klaus

>
> Klaus
>
>>
>> Thanks
>> David
>>
>>
>>
>> On Thu, 31 Aug 2023 at 11:59, David Dolan  wrote:
>>
>>> I just tried removing all the quorum options setting back to defaults so
>>> no last_man_standing or wait_for_all.
>>> I still see the same behaviour where the third node is fenced if I bring
>>> down services on two nodes.
>>> Thanks
>>> David
>>>
>>> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger 
>>> wrote:
>>>


 On Thu, Aug 31, 2023 at 12:28 PM David Dolan 
 wrote:

>
>
> On Wed, 30 Aug 2023 at 17:35, David Dolan 
> wrote:
>
>>
>>
>> > Hi All,
>>> >
>>> > I'm running Pacemaker on Centos7
>>> > Name: pcs
>>> > Version : 0.9.169
>>> > Release : 3.el7.centos.3
>>> > Architecture: x86_64
>>> >
>>> >
>>> Besides the pcs-version versions of the other
>>> cluster-stack-components
>>> could be interesting. (pacemaker, corosync)
>>>
>>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
>> corosynclib-2.4.5-7.el7_9.2.x86_64
>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
>> fence-agents-common-4.2.1-41.el7_9.6.x86_64
>> corosync-2.4.5-7.el7_9.2.x86_64
>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
>> pacemaker-1.1.23-1.el7_9.1.x86_64
>> pcs-0.9.169-3.el7.centos.3.x86_64
>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>
>>>
>>>
>>> > I'm performing some cluster failover tests in a 3 node cluster. We
>>> have 3
>>> > resources in the cluster.
>>> > I was trying to see if I could get it working if 2 nodes fail at
>>> different
>>> > times. I'd like the 3 resources to then run on one node.
>>> >
>>> > The quorum options I've configured are as follows
>>> > [root@node1 ~]# pcs quorum config
>>> > Options:
>>> >   auto_tie_breaker: 1
>>> >   last_man_standing: 1
>>> >   last_man_standing_window: 1
>>> >   wait_for_all: 1
>>> >
>>> >
>>> Not sure if the combination of auto_tie_breaker and
>>> last_man_standing makes
>>> sense.
>>> And as you have a cluster with an odd number of nodes
>>> auto_tie_breaker
>>> should be
>>> disabled anyway I guess.
>>>
>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>
>>>
>>>
>>> > [root@node1 ~]# pcs quorum status
>>> > Quorum information
>>> > --
>>> > Date: Wed Aug 30 11:20:04 2023
>>> > Quorum provider:  corosync_votequorum
>>> > Nodes:3
>>> > Node ID:  1
>>> > Ring ID:  1/1538
>>> > Quorate:  Yes
>>> >
>>> > Votequorum information
>>> > --
>>> > Expected votes:   3
>>> > Highest expected: 3
>>> > Total votes:  3
>>> > Quorum:   2
>>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>>> >
>>> > Membership information
>>> > --
>>> > Nodeid  VotesQdevice Name
>>> >  1  1 NR node1 (local)
>>> >  2  1 NR node2
>>> >  3  1 NR node3
>>> >
>>> > If I stop the cluster services on node 2 and 3, the groups all
>>> failover to
>>> > node 1 since it is the node with the lowest ID
>>> > But if I stop them on node1 and node 2 or node1 and node3, the
>>> cluster
>>> > fails.
>>> >
>>> > I tried 

Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Klaus Wenninger
On Mon, Sep 4, 2023 at 12:45 PM David Dolan  wrote:

> Hi Klaus,
>
> With default quorum options I've performed the following on my 3 node
> cluster
>
> Bring down cluster services on one node - the running services migrate to
> another node
> Wait 3 minutes
> Bring down cluster services on one of the two remaining nodes - the
> surviving node in the cluster is then fenced
>
> Instead of the surviving node being fenced, I hoped that the services
> would migrate and run on that remaining node.
>
> Just looking for confirmation that my understanding is ok and if I'm
> missing something?
>

As said I've never used it ...
Well when down to 2 nodes LMS per definition is getting into trouble as
after another
outage any of them is gonna be alone. In case of an ordered shutdown this
could
possibly be circumvented though. So I guess your fist attempt to enable
auto-tie-breaker
was the right idea. Like this you will have further service at least on one
of the nodes.
So I guess what you were seeing is the right - and unfortunately only
possible - behavior.
Where LMS shines is probably scenarios with substantially more nodes.

Klaus

>
> Thanks
> David
>
>
>
> On Thu, 31 Aug 2023 at 11:59, David Dolan  wrote:
>
>> I just tried removing all the quorum options setting back to defaults so
>> no last_man_standing or wait_for_all.
>> I still see the same behaviour where the third node is fenced if I bring
>> down services on two nodes.
>> Thanks
>> David
>>
>> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger 
>> wrote:
>>
>>>
>>>
>>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan 
>>> wrote:
>>>


 On Wed, 30 Aug 2023 at 17:35, David Dolan 
 wrote:

>
>
> > Hi All,
>> >
>> > I'm running Pacemaker on Centos7
>> > Name: pcs
>> > Version : 0.9.169
>> > Release : 3.el7.centos.3
>> > Architecture: x86_64
>> >
>> >
>> Besides the pcs-version versions of the other cluster-stack-components
>> could be interesting. (pacemaker, corosync)
>>
>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
> corosynclib-2.4.5-7.el7_9.2.x86_64
> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
> fence-agents-common-4.2.1-41.el7_9.6.x86_64
> corosync-2.4.5-7.el7_9.2.x86_64
> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
> pacemaker-1.1.23-1.el7_9.1.x86_64
> pcs-0.9.169-3.el7.centos.3.x86_64
> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>
>>
>>
>> > I'm performing some cluster failover tests in a 3 node cluster. We
>> have 3
>> > resources in the cluster.
>> > I was trying to see if I could get it working if 2 nodes fail at
>> different
>> > times. I'd like the 3 resources to then run on one node.
>> >
>> > The quorum options I've configured are as follows
>> > [root@node1 ~]# pcs quorum config
>> > Options:
>> >   auto_tie_breaker: 1
>> >   last_man_standing: 1
>> >   last_man_standing_window: 1
>> >   wait_for_all: 1
>> >
>> >
>> Not sure if the combination of auto_tie_breaker and last_man_standing
>> makes
>> sense.
>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>> should be
>> disabled anyway I guess.
>>
> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>
>>
>>
>> > [root@node1 ~]# pcs quorum status
>> > Quorum information
>> > --
>> > Date: Wed Aug 30 11:20:04 2023
>> > Quorum provider:  corosync_votequorum
>> > Nodes:3
>> > Node ID:  1
>> > Ring ID:  1/1538
>> > Quorate:  Yes
>> >
>> > Votequorum information
>> > --
>> > Expected votes:   3
>> > Highest expected: 3
>> > Total votes:  3
>> > Quorum:   2
>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>> >
>> > Membership information
>> > --
>> > Nodeid  VotesQdevice Name
>> >  1  1 NR node1 (local)
>> >  2  1 NR node2
>> >  3  1 NR node3
>> >
>> > If I stop the cluster services on node 2 and 3, the groups all
>> failover to
>> > node 1 since it is the node with the lowest ID
>> > But if I stop them on node1 and node 2 or node1 and node3, the
>> cluster
>> > fails.
>> >
>> > I tried adding this line to corosync.conf and I could then bring
>> down the
>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
>> last,
>> > the cluster failed
>> > auto_tie_breaker_node: 1  3
>> >
>> > This line had the same outcome as using 1 3
>> > auto_tie_breaker_node: 1  2 3
>> >
>> >
>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
>> rather

Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread Andrei Borzenkov
On Mon, Sep 4, 2023 at 1:45 PM David Dolan  wrote:
>
> Hi Klaus,
>
> With default quorum options I've performed the following on my 3 node cluster
>
> Bring down cluster services on one node - the running services migrate to 
> another node
> Wait 3 minutes
> Bring down cluster services on one of the two remaining nodes - the surviving 
> node in the cluster is then fenced
>

Is it fenced or is it reset? It is not the same.

The default for no-quorum-policy is "stop". So you either have
"no-quorum-policy" set to "suicide", or node is reset by something
outside of pacemaker. This "something" may initiate fencing too.

> Instead of the surviving node being fenced, I hoped that the services would 
> migrate and run on that remaining node.
>
> Just looking for confirmation that my understanding is ok and if I'm missing 
> something?
>
> Thanks
> David
>
>
>
> On Thu, 31 Aug 2023 at 11:59, David Dolan  wrote:
>>
>> I just tried removing all the quorum options setting back to defaults so no 
>> last_man_standing or wait_for_all.
>> I still see the same behaviour where the third node is fenced if I bring 
>> down services on two nodes.
>> Thanks
>> David
>>
>> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger  wrote:
>>>
>>>
>>>
>>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan  wrote:



 On Wed, 30 Aug 2023 at 17:35, David Dolan  wrote:
>
>
>
>> > Hi All,
>> >
>> > I'm running Pacemaker on Centos7
>> > Name: pcs
>> > Version : 0.9.169
>> > Release : 3.el7.centos.3
>> > Architecture: x86_64
>> >
>> >
>> Besides the pcs-version versions of the other cluster-stack-components
>> could be interesting. (pacemaker, corosync)
>
>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
> corosynclib-2.4.5-7.el7_9.2.x86_64
> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
> fence-agents-common-4.2.1-41.el7_9.6.x86_64
> corosync-2.4.5-7.el7_9.2.x86_64
> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
> pacemaker-1.1.23-1.el7_9.1.x86_64
> pcs-0.9.169-3.el7.centos.3.x86_64
> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>
>>
>>
>> > I'm performing some cluster failover tests in a 3 node cluster. We 
>> > have 3
>> > resources in the cluster.
>> > I was trying to see if I could get it working if 2 nodes fail at 
>> > different
>> > times. I'd like the 3 resources to then run on one node.
>> >
>> > The quorum options I've configured are as follows
>> > [root@node1 ~]# pcs quorum config
>> > Options:
>> >   auto_tie_breaker: 1
>> >   last_man_standing: 1
>> >   last_man_standing_window: 1
>> >   wait_for_all: 1
>> >
>> >
>> Not sure if the combination of auto_tie_breaker and last_man_standing 
>> makes
>> sense.
>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>> should be
>> disabled anyway I guess.
>
> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>
>>
>>
>> > [root@node1 ~]# pcs quorum status
>> > Quorum information
>> > --
>> > Date: Wed Aug 30 11:20:04 2023
>> > Quorum provider:  corosync_votequorum
>> > Nodes:3
>> > Node ID:  1
>> > Ring ID:  1/1538
>> > Quorate:  Yes
>> >
>> > Votequorum information
>> > --
>> > Expected votes:   3
>> > Highest expected: 3
>> > Total votes:  3
>> > Quorum:   2
>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>> >
>> > Membership information
>> > --
>> > Nodeid  VotesQdevice Name
>> >  1  1 NR node1 (local)
>> >  2  1 NR node2
>> >  3  1 NR node3
>> >
>> > If I stop the cluster services on node 2 and 3, the groups all 
>> > failover to
>> > node 1 since it is the node with the lowest ID
>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
>> > fails.
>> >
>> > I tried adding this line to corosync.conf and I could then bring down 
>> > the
>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until 
>> > last,
>> > the cluster failed
>> > auto_tie_breaker_node: 1  3
>> >
>> > This line had the same outcome as using 1 3
>> > auto_tie_breaker_node: 1  2 3
>> >
>> >
>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but 
>> rather
>> sounds dangerous if that configuration is possible at all.
>>
>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>> recognized) misconfiguration.
>> Did you wait long enough between letting the 2 nodes fail?
>
> I've done it so many times so I 

Re: [ClusterLabs] issue during Pacemaker failover testing

2023-09-04 Thread David Dolan
Hi Klaus,

With default quorum options I've performed the following on my 3 node
cluster

Bring down cluster services on one node - the running services migrate to
another node
Wait 3 minutes
Bring down cluster services on one of the two remaining nodes - the
surviving node in the cluster is then fenced

Instead of the surviving node being fenced, I hoped that the services would
migrate and run on that remaining node.

Just looking for confirmation that my understanding is ok and if I'm
missing something?

Thanks
David



On Thu, 31 Aug 2023 at 11:59, David Dolan  wrote:

> I just tried removing all the quorum options setting back to defaults so
> no last_man_standing or wait_for_all.
> I still see the same behaviour where the third node is fenced if I bring
> down services on two nodes.
> Thanks
> David
>
> On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger  wrote:
>
>>
>>
>> On Thu, Aug 31, 2023 at 12:28 PM David Dolan 
>> wrote:
>>
>>>
>>>
>>> On Wed, 30 Aug 2023 at 17:35, David Dolan  wrote:
>>>


 > Hi All,
> >
> > I'm running Pacemaker on Centos7
> > Name: pcs
> > Version : 0.9.169
> > Release : 3.el7.centos.3
> > Architecture: x86_64
> >
> >
> Besides the pcs-version versions of the other cluster-stack-components
> could be interesting. (pacemaker, corosync)
>
  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
 fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
 corosynclib-2.4.5-7.el7_9.2.x86_64
 pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
 fence-agents-common-4.2.1-41.el7_9.6.x86_64
 corosync-2.4.5-7.el7_9.2.x86_64
 pacemaker-cli-1.1.23-1.el7_9.1.x86_64
 pacemaker-1.1.23-1.el7_9.1.x86_64
 pcs-0.9.169-3.el7.centos.3.x86_64
 pacemaker-libs-1.1.23-1.el7_9.1.x86_64

>
>
> > I'm performing some cluster failover tests in a 3 node cluster. We
> have 3
> > resources in the cluster.
> > I was trying to see if I could get it working if 2 nodes fail at
> different
> > times. I'd like the 3 resources to then run on one node.
> >
> > The quorum options I've configured are as follows
> > [root@node1 ~]# pcs quorum config
> > Options:
> >   auto_tie_breaker: 1
> >   last_man_standing: 1
> >   last_man_standing_window: 1
> >   wait_for_all: 1
> >
> >
> Not sure if the combination of auto_tie_breaker and last_man_standing
> makes
> sense.
> And as you have a cluster with an odd number of nodes auto_tie_breaker
> should be
> disabled anyway I guess.
>
 Ah ok I'll try removing auto_tie_breaker and leave last_man_standing

>
>
> > [root@node1 ~]# pcs quorum status
> > Quorum information
> > --
> > Date: Wed Aug 30 11:20:04 2023
> > Quorum provider:  corosync_votequorum
> > Nodes:3
> > Node ID:  1
> > Ring ID:  1/1538
> > Quorate:  Yes
> >
> > Votequorum information
> > --
> > Expected votes:   3
> > Highest expected: 3
> > Total votes:  3
> > Quorum:   2
> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
> >
> > Membership information
> > --
> > Nodeid  VotesQdevice Name
> >  1  1 NR node1 (local)
> >  2  1 NR node2
> >  3  1 NR node3
> >
> > If I stop the cluster services on node 2 and 3, the groups all
> failover to
> > node 1 since it is the node with the lowest ID
> > But if I stop them on node1 and node 2 or node1 and node3, the
> cluster
> > fails.
> >
> > I tried adding this line to corosync.conf and I could then bring
> down the
> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
> last,
> > the cluster failed
> > auto_tie_breaker_node: 1  3
> >
> > This line had the same outcome as using 1 3
> > auto_tie_breaker_node: 1  2 3
> >
> >
> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
> rather
> sounds dangerous if that configuration is possible at all.
>
> Maybe the misbehavior of last_man_standing is due to this (maybe not
> recognized) misconfiguration.
> Did you wait long enough between letting the 2 nodes fail?
>
 I've done it so many times so I believe so. But I'll try remove the
 auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
 I leave a couple of minutes between bringing down the nodes and post back.

>>> Just confirming I removed the auto_tie_breaker config and tested. Quorum
>>> configuration is as follows:
>>>  Options:
>>>   last_man_standing: 1
>>>   last_man_standing_window: 1
>>>   wait_for_all: 1
>>>
>>> I waited 2-3 minutes between stopping cluster services 

Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-31 Thread David Dolan
I just tried removing all the quorum options setting back to defaults so no
last_man_standing or wait_for_all.
I still see the same behaviour where the third node is fenced if I bring
down services on two nodes.
Thanks
David

On Thu, 31 Aug 2023 at 11:44, Klaus Wenninger  wrote:

>
>
> On Thu, Aug 31, 2023 at 12:28 PM David Dolan 
> wrote:
>
>>
>>
>> On Wed, 30 Aug 2023 at 17:35, David Dolan  wrote:
>>
>>>
>>>
>>> > Hi All,
 >
 > I'm running Pacemaker on Centos7
 > Name: pcs
 > Version : 0.9.169
 > Release : 3.el7.centos.3
 > Architecture: x86_64
 >
 >
 Besides the pcs-version versions of the other cluster-stack-components
 could be interesting. (pacemaker, corosync)

>>>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
>>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
>>> corosynclib-2.4.5-7.el7_9.2.x86_64
>>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
>>> fence-agents-common-4.2.1-41.el7_9.6.x86_64
>>> corosync-2.4.5-7.el7_9.2.x86_64
>>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
>>> pacemaker-1.1.23-1.el7_9.1.x86_64
>>> pcs-0.9.169-3.el7.centos.3.x86_64
>>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>>


 > I'm performing some cluster failover tests in a 3 node cluster. We
 have 3
 > resources in the cluster.
 > I was trying to see if I could get it working if 2 nodes fail at
 different
 > times. I'd like the 3 resources to then run on one node.
 >
 > The quorum options I've configured are as follows
 > [root@node1 ~]# pcs quorum config
 > Options:
 >   auto_tie_breaker: 1
 >   last_man_standing: 1
 >   last_man_standing_window: 1
 >   wait_for_all: 1
 >
 >
 Not sure if the combination of auto_tie_breaker and last_man_standing
 makes
 sense.
 And as you have a cluster with an odd number of nodes auto_tie_breaker
 should be
 disabled anyway I guess.

>>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>>


 > [root@node1 ~]# pcs quorum status
 > Quorum information
 > --
 > Date: Wed Aug 30 11:20:04 2023
 > Quorum provider:  corosync_votequorum
 > Nodes:3
 > Node ID:  1
 > Ring ID:  1/1538
 > Quorate:  Yes
 >
 > Votequorum information
 > --
 > Expected votes:   3
 > Highest expected: 3
 > Total votes:  3
 > Quorum:   2
 > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
 >
 > Membership information
 > --
 > Nodeid  VotesQdevice Name
 >  1  1 NR node1 (local)
 >  2  1 NR node2
 >  3  1 NR node3
 >
 > If I stop the cluster services on node 2 and 3, the groups all
 failover to
 > node 1 since it is the node with the lowest ID
 > But if I stop them on node1 and node 2 or node1 and node3, the cluster
 > fails.
 >
 > I tried adding this line to corosync.conf and I could then bring down
 the
 > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
 last,
 > the cluster failed
 > auto_tie_breaker_node: 1  3
 >
 > This line had the same outcome as using 1 3
 > auto_tie_breaker_node: 1  2 3
 >
 >
 Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
 rather
 sounds dangerous if that configuration is possible at all.

 Maybe the misbehavior of last_man_standing is due to this (maybe not
 recognized) misconfiguration.
 Did you wait long enough between letting the 2 nodes fail?

>>> I've done it so many times so I believe so. But I'll try remove the
>>> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
>>> I leave a couple of minutes between bringing down the nodes and post back.
>>>
>> Just confirming I removed the auto_tie_breaker config and tested. Quorum
>> configuration is as follows:
>>  Options:
>>   last_man_standing: 1
>>   last_man_standing_window: 1
>>   wait_for_all: 1
>>
>> I waited 2-3 minutes between stopping cluster services on two nodes via
>> pcs cluster stop
>> The remaining cluster node is then fenced. I was hoping the remaining
>> node would stay online running the resources.
>>
>
> Yep - that would've been my understanding as well.
> But honestly I've never used last_man_standing in this context - wasn't
> even aware that it was
> offered without qdevice nor have I checked how it is implemented.
>
> Klaus
>
>>
>>
 Klaus


 > So I'd like it to failover when any combination of two nodes fail but
 I've
 > only had success when the middle node isn't last.
 >
 > Thanks
 > David




___
Manage your subscription:

Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-31 Thread Klaus Wenninger
On Thu, Aug 31, 2023 at 12:28 PM David Dolan  wrote:

>
>
> On Wed, 30 Aug 2023 at 17:35, David Dolan  wrote:
>
>>
>>
>> > Hi All,
>>> >
>>> > I'm running Pacemaker on Centos7
>>> > Name: pcs
>>> > Version : 0.9.169
>>> > Release : 3.el7.centos.3
>>> > Architecture: x86_64
>>> >
>>> >
>>> Besides the pcs-version versions of the other cluster-stack-components
>>> could be interesting. (pacemaker, corosync)
>>>
>>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
>> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
>> corosynclib-2.4.5-7.el7_9.2.x86_64
>> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
>> fence-agents-common-4.2.1-41.el7_9.6.x86_64
>> corosync-2.4.5-7.el7_9.2.x86_64
>> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
>> pacemaker-1.1.23-1.el7_9.1.x86_64
>> pcs-0.9.169-3.el7.centos.3.x86_64
>> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>>
>>>
>>>
>>> > I'm performing some cluster failover tests in a 3 node cluster. We
>>> have 3
>>> > resources in the cluster.
>>> > I was trying to see if I could get it working if 2 nodes fail at
>>> different
>>> > times. I'd like the 3 resources to then run on one node.
>>> >
>>> > The quorum options I've configured are as follows
>>> > [root@node1 ~]# pcs quorum config
>>> > Options:
>>> >   auto_tie_breaker: 1
>>> >   last_man_standing: 1
>>> >   last_man_standing_window: 1
>>> >   wait_for_all: 1
>>> >
>>> >
>>> Not sure if the combination of auto_tie_breaker and last_man_standing
>>> makes
>>> sense.
>>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>>> should be
>>> disabled anyway I guess.
>>>
>> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>>
>>>
>>>
>>> > [root@node1 ~]# pcs quorum status
>>> > Quorum information
>>> > --
>>> > Date: Wed Aug 30 11:20:04 2023
>>> > Quorum provider:  corosync_votequorum
>>> > Nodes:3
>>> > Node ID:  1
>>> > Ring ID:  1/1538
>>> > Quorate:  Yes
>>> >
>>> > Votequorum information
>>> > --
>>> > Expected votes:   3
>>> > Highest expected: 3
>>> > Total votes:  3
>>> > Quorum:   2
>>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>>> >
>>> > Membership information
>>> > --
>>> > Nodeid  VotesQdevice Name
>>> >  1  1 NR node1 (local)
>>> >  2  1 NR node2
>>> >  3  1 NR node3
>>> >
>>> > If I stop the cluster services on node 2 and 3, the groups all
>>> failover to
>>> > node 1 since it is the node with the lowest ID
>>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
>>> > fails.
>>> >
>>> > I tried adding this line to corosync.conf and I could then bring down
>>> the
>>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
>>> last,
>>> > the cluster failed
>>> > auto_tie_breaker_node: 1  3
>>> >
>>> > This line had the same outcome as using 1 3
>>> > auto_tie_breaker_node: 1  2 3
>>> >
>>> >
>>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but
>>> rather
>>> sounds dangerous if that configuration is possible at all.
>>>
>>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>>> recognized) misconfiguration.
>>> Did you wait long enough between letting the 2 nodes fail?
>>>
>> I've done it so many times so I believe so. But I'll try remove the
>> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
>> I leave a couple of minutes between bringing down the nodes and post back.
>>
> Just confirming I removed the auto_tie_breaker config and tested. Quorum
> configuration is as follows:
>  Options:
>   last_man_standing: 1
>   last_man_standing_window: 1
>   wait_for_all: 1
>
> I waited 2-3 minutes between stopping cluster services on two nodes via
> pcs cluster stop
> The remaining cluster node is then fenced. I was hoping the remaining node
> would stay online running the resources.
>

Yep - that would've been my understanding as well.
But honestly I've never used last_man_standing in this context - wasn't
even aware that it was
offered without qdevice nor have I checked how it is implemented.

Klaus

>
>
>>> Klaus
>>>
>>>
>>> > So I'd like it to failover when any combination of two nodes fail but
>>> I've
>>> > only had success when the middle node isn't last.
>>> >
>>> > Thanks
>>> > David
>>>
>>>
>>>
>>>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-31 Thread David Dolan
On Wed, 30 Aug 2023 at 17:35, David Dolan  wrote:

>
>
> > Hi All,
>> >
>> > I'm running Pacemaker on Centos7
>> > Name: pcs
>> > Version : 0.9.169
>> > Release : 3.el7.centos.3
>> > Architecture: x86_64
>> >
>> >
>> Besides the pcs-version versions of the other cluster-stack-components
>> could be interesting. (pacemaker, corosync)
>>
>  rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
> fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
> corosynclib-2.4.5-7.el7_9.2.x86_64
> pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
> fence-agents-common-4.2.1-41.el7_9.6.x86_64
> corosync-2.4.5-7.el7_9.2.x86_64
> pacemaker-cli-1.1.23-1.el7_9.1.x86_64
> pacemaker-1.1.23-1.el7_9.1.x86_64
> pcs-0.9.169-3.el7.centos.3.x86_64
> pacemaker-libs-1.1.23-1.el7_9.1.x86_64
>
>>
>>
>> > I'm performing some cluster failover tests in a 3 node cluster. We have
>> 3
>> > resources in the cluster.
>> > I was trying to see if I could get it working if 2 nodes fail at
>> different
>> > times. I'd like the 3 resources to then run on one node.
>> >
>> > The quorum options I've configured are as follows
>> > [root@node1 ~]# pcs quorum config
>> > Options:
>> >   auto_tie_breaker: 1
>> >   last_man_standing: 1
>> >   last_man_standing_window: 1
>> >   wait_for_all: 1
>> >
>> >
>> Not sure if the combination of auto_tie_breaker and last_man_standing
>> makes
>> sense.
>> And as you have a cluster with an odd number of nodes auto_tie_breaker
>> should be
>> disabled anyway I guess.
>>
> Ah ok I'll try removing auto_tie_breaker and leave last_man_standing
>
>>
>>
>> > [root@node1 ~]# pcs quorum status
>> > Quorum information
>> > --
>> > Date: Wed Aug 30 11:20:04 2023
>> > Quorum provider:  corosync_votequorum
>> > Nodes:3
>> > Node ID:  1
>> > Ring ID:  1/1538
>> > Quorate:  Yes
>> >
>> > Votequorum information
>> > --
>> > Expected votes:   3
>> > Highest expected: 3
>> > Total votes:  3
>> > Quorum:   2
>> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>> >
>> > Membership information
>> > --
>> > Nodeid  VotesQdevice Name
>> >  1  1 NR node1 (local)
>> >  2  1 NR node2
>> >  3  1 NR node3
>> >
>> > If I stop the cluster services on node 2 and 3, the groups all failover
>> to
>> > node 1 since it is the node with the lowest ID
>> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
>> > fails.
>> >
>> > I tried adding this line to corosync.conf and I could then bring down
>> the
>> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until
>> last,
>> > the cluster failed
>> > auto_tie_breaker_node: 1  3
>> >
>> > This line had the same outcome as using 1 3
>> > auto_tie_breaker_node: 1  2 3
>> >
>> >
>> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather
>> sounds dangerous if that configuration is possible at all.
>>
>> Maybe the misbehavior of last_man_standing is due to this (maybe not
>> recognized) misconfiguration.
>> Did you wait long enough between letting the 2 nodes fail?
>>
> I've done it so many times so I believe so. But I'll try remove the
> auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
> I leave a couple of minutes between bringing down the nodes and post back.
>
Just confirming I removed the auto_tie_breaker config and tested. Quorum
configuration is as follows:
 Options:
  last_man_standing: 1
  last_man_standing_window: 1
  wait_for_all: 1

I waited 2-3 minutes between stopping cluster services on two nodes via pcs
cluster stop
The remaining cluster node is then fenced. I was hoping the remaining node
would stay online running the resources.


>> Klaus
>>
>>
>> > So I'd like it to failover when any combination of two nodes fail but
>> I've
>> > only had success when the middle node isn't last.
>> >
>> > Thanks
>> > David
>>
>>
>>
>>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-30 Thread Andrei Borzenkov

On 30.08.2023 19:23, David Dolan wrote:


Use fencing. Quorum is not a replacement for fencing. With (reliable)
fencing you can simply run pacemaker with no-quorum-policy=ignore.

The practical problem is that usually the last resort that will work
in all cases is SBD + suicide and SBD cannot work without quorum.

Ah I forgot to mention I do have fencing setup, which connects to Vmware

Virtualcenter.
Do you think it's safe to set that no-quorum-policy=ignore?


fencing is always safe. fencing guarantees that when nodes take over 
resources of a missing node, the missing node is actually not running 
any of these resources. Yes, if fencing fails resource won't be taken 
over but usually it is better than possible corruption. Quorum is 
entirely orthogonal to that. If your two nodes lost connection to the 
third node, they will happily take over resources whether the third node 
already stopped them or not.


If you actually mean "is it guaranteed that the survived node will 
always be able to take over resources from other nodes" - no, it depends 
on network connectivity, if connection to VC is lost (or if anything bad 
happens during communication with VC, like somebody changed password you 
use) fencing will fail and resources won't be taken over.

___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-30 Thread David Dolan
> Hi All,
> >
> > I'm running Pacemaker on Centos7
> > Name: pcs
> > Version : 0.9.169
> > Release : 3.el7.centos.3
> > Architecture: x86_64
> >
> >
> Besides the pcs-version versions of the other cluster-stack-components
> could be interesting. (pacemaker, corosync)
>
 rpm -qa | egrep "pacemaker|pcs|corosync|fence-agents"
fence-agents-vmware-rest-4.2.1-41.el7_9.6.x86_64
corosynclib-2.4.5-7.el7_9.2.x86_64
pacemaker-cluster-libs-1.1.23-1.el7_9.1.x86_64
fence-agents-common-4.2.1-41.el7_9.6.x86_64
corosync-2.4.5-7.el7_9.2.x86_64
pacemaker-cli-1.1.23-1.el7_9.1.x86_64
pacemaker-1.1.23-1.el7_9.1.x86_64
pcs-0.9.169-3.el7.centos.3.x86_64
pacemaker-libs-1.1.23-1.el7_9.1.x86_64

>
>
> > I'm performing some cluster failover tests in a 3 node cluster. We have 3
> > resources in the cluster.
> > I was trying to see if I could get it working if 2 nodes fail at
> different
> > times. I'd like the 3 resources to then run on one node.
> >
> > The quorum options I've configured are as follows
> > [root@node1 ~]# pcs quorum config
> > Options:
> >   auto_tie_breaker: 1
> >   last_man_standing: 1
> >   last_man_standing_window: 1
> >   wait_for_all: 1
> >
> >
> Not sure if the combination of auto_tie_breaker and last_man_standing makes
> sense.
> And as you have a cluster with an odd number of nodes auto_tie_breaker
> should be
> disabled anyway I guess.
>
Ah ok I'll try removing auto_tie_breaker and leave last_man_standing

>
>
> > [root@node1 ~]# pcs quorum status
> > Quorum information
> > --
> > Date: Wed Aug 30 11:20:04 2023
> > Quorum provider:  corosync_votequorum
> > Nodes:3
> > Node ID:  1
> > Ring ID:  1/1538
> > Quorate:  Yes
> >
> > Votequorum information
> > --
> > Expected votes:   3
> > Highest expected: 3
> > Total votes:  3
> > Quorum:   2
> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
> >
> > Membership information
> > --
> > Nodeid  VotesQdevice Name
> >  1  1 NR node1 (local)
> >  2  1 NR node2
> >  3  1 NR node3
> >
> > If I stop the cluster services on node 2 and 3, the groups all failover
> to
> > node 1 since it is the node with the lowest ID
> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
> > fails.
> >
> > I tried adding this line to corosync.conf and I could then bring down the
> > services on node 1 and 2 or node 2 and 3 but if I left node 2 until last,
> > the cluster failed
> > auto_tie_breaker_node: 1  3
> >
> > This line had the same outcome as using 1 3
> > auto_tie_breaker_node: 1  2 3
> >
> >
> Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather
> sounds dangerous if that configuration is possible at all.
>
> Maybe the misbehavior of last_man_standing is due to this (maybe not
> recognized) misconfiguration.
> Did you wait long enough between letting the 2 nodes fail?
>
I've done it so many times so I believe so. But I'll try remove the
auto_tie_breaker config, leaving the last_man_standing. I'll also make sure
I leave a couple of minutes between bringing down the nodes and post back.

>
> Klaus
>
>
> > So I'd like it to failover when any combination of two nodes fail but
> I've
> > only had success when the middle node isn't last.
> >
> > Thanks
> > David
>
>
>
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-30 Thread David Dolan
>
> >
> > Hi All,
> >
> > I'm running Pacemaker on Centos7
> > Name: pcs
> > Version : 0.9.169
> > Release : 3.el7.centos.3
> > Architecture: x86_64
> >
> >
> > I'm performing some cluster failover tests in a 3 node cluster. We have
> 3 resources in the cluster.
> > I was trying to see if I could get it working if 2 nodes fail at
> different times. I'd like the 3 resources to then run on one node.
> >
> > The quorum options I've configured are as follows
> > [root@node1 ~]# pcs quorum config
> > Options:
> >   auto_tie_breaker: 1
> >   last_man_standing: 1
> >   last_man_standing_window: 1
> >   wait_for_all: 1
> >
> > [root@node1 ~]# pcs quorum status
> > Quorum information
> > --
> > Date: Wed Aug 30 11:20:04 2023
> > Quorum provider:  corosync_votequorum
> > Nodes:3
> > Node ID:  1
> > Ring ID:  1/1538
> > Quorate:  Yes
> >
> > Votequorum information
> > --
> > Expected votes:   3
> > Highest expected: 3
> > Total votes:  3
> > Quorum:   2
> > Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
> >
> > Membership information
> > --
> > Nodeid  VotesQdevice Name
> >  1  1 NR node1 (local)
> >  2  1 NR node2
> >  3  1 NR node3
> >
> > If I stop the cluster services on node 2 and 3, the groups all failover
> to node 1 since it is the node with the lowest ID
> > But if I stop them on node1 and node 2 or node1 and node3, the cluster
> fails.
> >
> > I tried adding this line to corosync.conf and I could then bring down
> the services on node 1 and 2 or node 2 and 3 but if I left node 2 until
> last, the cluster failed
> > auto_tie_breaker_node: 1  3
> >
> > This line had the same outcome as using 1 3
> > auto_tie_breaker_node: 1  2 3
> >
> > So I'd like it to failover when any combination of two nodes fail but
> I've only had success when the middle node isn't last.
> >
>
> Use fencing. Quorum is not a replacement for fencing. With (reliable)
> fencing you can simply run pacemaker with no-quorum-policy=ignore.
>
> The practical problem is that usually the last resort that will work
> in all cases is SBD + suicide and SBD cannot work without quorum.
>
> Ah I forgot to mention I do have fencing setup, which connects to Vmware
Virtualcenter.
Do you think it's safe to set that no-quorum-policy=ignore?
Thanks
David

>
>
> **
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-30 Thread Klaus Wenninger
On Wed, Aug 30, 2023 at 2:34 PM David Dolan  wrote:

> Hi All,
>
> I'm running Pacemaker on Centos7
> Name: pcs
> Version : 0.9.169
> Release : 3.el7.centos.3
> Architecture: x86_64
>
>
Besides the pcs-version versions of the other cluster-stack-components
could be interesting. (pacemaker, corosync)


> I'm performing some cluster failover tests in a 3 node cluster. We have 3
> resources in the cluster.
> I was trying to see if I could get it working if 2 nodes fail at different
> times. I'd like the 3 resources to then run on one node.
>
> The quorum options I've configured are as follows
> [root@node1 ~]# pcs quorum config
> Options:
>   auto_tie_breaker: 1
>   last_man_standing: 1
>   last_man_standing_window: 1
>   wait_for_all: 1
>
>
Not sure if the combination of auto_tie_breaker and last_man_standing makes
sense.
And as you have a cluster with an odd number of nodes auto_tie_breaker
should be
disabled anyway I guess.


> [root@node1 ~]# pcs quorum status
> Quorum information
> --
> Date: Wed Aug 30 11:20:04 2023
> Quorum provider:  corosync_votequorum
> Nodes:3
> Node ID:  1
> Ring ID:  1/1538
> Quorate:  Yes
>
> Votequorum information
> --
> Expected votes:   3
> Highest expected: 3
> Total votes:  3
> Quorum:   2
> Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>
> Membership information
> --
> Nodeid  VotesQdevice Name
>  1  1 NR node1 (local)
>  2  1 NR node2
>  3  1 NR node3
>
> If I stop the cluster services on node 2 and 3, the groups all failover to
> node 1 since it is the node with the lowest ID
> But if I stop them on node1 and node 2 or node1 and node3, the cluster
> fails.
>
> I tried adding this line to corosync.conf and I could then bring down the
> services on node 1 and 2 or node 2 and 3 but if I left node 2 until last,
> the cluster failed
> auto_tie_breaker_node: 1  3
>
> This line had the same outcome as using 1 3
> auto_tie_breaker_node: 1  2 3
>
>
Giving multiple auto_tie_breaker-nodes doesn't make sense to me but rather
sounds dangerous if that configuration is possible at all.

Maybe the misbehavior of last_man_standing is due to this (maybe not
recognized) misconfiguration.
Did you wait long enough between letting the 2 nodes fail?

Klaus


> So I'd like it to failover when any combination of two nodes fail but I've
> only had success when the middle node isn't last.
>
> Thanks
> David
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] issue during Pacemaker failover testing

2023-08-30 Thread Andrei Borzenkov
On Wed, Aug 30, 2023 at 3:34 PM David Dolan  wrote:
>
> Hi All,
>
> I'm running Pacemaker on Centos7
> Name: pcs
> Version : 0.9.169
> Release : 3.el7.centos.3
> Architecture: x86_64
>
>
> I'm performing some cluster failover tests in a 3 node cluster. We have 3 
> resources in the cluster.
> I was trying to see if I could get it working if 2 nodes fail at different 
> times. I'd like the 3 resources to then run on one node.
>
> The quorum options I've configured are as follows
> [root@node1 ~]# pcs quorum config
> Options:
>   auto_tie_breaker: 1
>   last_man_standing: 1
>   last_man_standing_window: 1
>   wait_for_all: 1
>
> [root@node1 ~]# pcs quorum status
> Quorum information
> --
> Date: Wed Aug 30 11:20:04 2023
> Quorum provider:  corosync_votequorum
> Nodes:3
> Node ID:  1
> Ring ID:  1/1538
> Quorate:  Yes
>
> Votequorum information
> --
> Expected votes:   3
> Highest expected: 3
> Total votes:  3
> Quorum:   2
> Flags:Quorate WaitForAll LastManStanding AutoTieBreaker
>
> Membership information
> --
> Nodeid  VotesQdevice Name
>  1  1 NR node1 (local)
>  2  1 NR node2
>  3  1 NR node3
>
> If I stop the cluster services on node 2 and 3, the groups all failover to 
> node 1 since it is the node with the lowest ID
> But if I stop them on node1 and node 2 or node1 and node3, the cluster fails.
>
> I tried adding this line to corosync.conf and I could then bring down the 
> services on node 1 and 2 or node 2 and 3 but if I left node 2 until last, the 
> cluster failed
> auto_tie_breaker_node: 1  3
>
> This line had the same outcome as using 1 3
> auto_tie_breaker_node: 1  2 3
>
> So I'd like it to failover when any combination of two nodes fail but I've 
> only had success when the middle node isn't last.
>

Use fencing. Quorum is not a replacement for fencing. With (reliable)
fencing you can simply run pacemaker with no-quorum-policy=ignore.

The practical problem is that usually the last resort that will work
in all cases is SBD + suicide and SBD cannot work without quorum.
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


[ClusterLabs] issue during Pacemaker failover testing

2023-08-30 Thread David Dolan
Hi All,

I'm running Pacemaker on Centos7
Name: pcs
Version : 0.9.169
Release : 3.el7.centos.3
Architecture: x86_64


I'm performing some cluster failover tests in a 3 node cluster. We have 3
resources in the cluster.
I was trying to see if I could get it working if 2 nodes fail at different
times. I'd like the 3 resources to then run on one node.

The quorum options I've configured are as follows
[root@node1 ~]# pcs quorum config
Options:
  auto_tie_breaker: 1
  last_man_standing: 1
  last_man_standing_window: 1
  wait_for_all: 1

[root@node1 ~]# pcs quorum status
Quorum information
--
Date: Wed Aug 30 11:20:04 2023
Quorum provider:  corosync_votequorum
Nodes:3
Node ID:  1
Ring ID:  1/1538
Quorate:  Yes

Votequorum information
--
Expected votes:   3
Highest expected: 3
Total votes:  3
Quorum:   2
Flags:Quorate WaitForAll LastManStanding AutoTieBreaker

Membership information
--
Nodeid  VotesQdevice Name
 1  1 NR node1 (local)
 2  1 NR node2
 3  1 NR node3

If I stop the cluster services on node 2 and 3, the groups all failover to
node 1 since it is the node with the lowest ID
But if I stop them on node1 and node 2 or node1 and node3, the cluster
fails.

I tried adding this line to corosync.conf and I could then bring down the
services on node 1 and 2 or node 2 and 3 but if I left node 2 until last,
the cluster failed
auto_tie_breaker_node: 1  3

This line had the same outcome as using 1 3
auto_tie_breaker_node: 1  2 3

So I'd like it to failover when any combination of two nodes fail but I've
only had success when the middle node isn't last.

Thanks
David
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/