Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Klaus Wenninger
On Wed, Jun 28, 2023 at 3:30 AM Priyanka Balotra <
priyanka.14balo...@gmail.com> wrote:

> I am using SLES 15 SP4. Is the no-quorum-policy still supported?
>
> Thanks
> Priyanka
>
> On Wed, 28 Jun 2023 at 12:46 AM, Ken Gaillot  wrote:
>
>> On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote:
>> > In this case stonith has been configured as a resource,
>> > primitive stonith-sbd stonith:external/sbd
>>
>
Then the error scenario you described looks like everybody lost connection
to the shared-storage. The nodes rebooting then probably rather suicided
instead of reading the poison-pill. And the quorate partition is staying
alive because
it is quorate but not seeing the shared-storage it can't verify that it had
been
able to write the poison-pill which makes the other nodes stay unclean.
But again just guessing ...


> >
>> > For it to be functional properly , the resource needs to be up, which
>> > is only possible if the system is quorate.
>>
>> Pacemaker can use a fence device even if its resource is not active.
>> The resource being active just allows Pacemaker to monitor the device
>> regularly.
>>
>> >
>> > Hence our requirement is to make the system quorate even if one Node
>> > of the cluster is up.
>> > Stonith will then take care of any split-brain scenarios.
>>
>> In that case it sounds like no-quorum-policy=ignore is actually what
>> you want.
>>
>
Still dangerous without something like wait-for-all - right?
With LMS I guess you should have the same effect without having explicitly
specified though.

Klaus


>
>> >
>> > Thanks
>> > Priyanka
>> >
>> > On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger 
>> > wrote:
>> > >
>> > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <
>> > > arvidj...@gmail.com> wrote:
>> > > > On 27.06.2023 07:21, Priyanka Balotra wrote:
>> > > > > Hi Andrei,
>> > > > > After this state the system went through some more fencings and
>> > > > we saw the
>> > > > > following state:
>> > > > >
>> > > > > :~ # crm status
>> > > > > Cluster Summary:
>> > > > >* Stack: corosync
>> > > > >* Current DC: FILE-2 (version
>> > > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36)
>> > > > - partition
>> > > > > with quorum
>> > > >
>> > > > It says "partition with quorum" so what exactly is the problem?
>> > >
>> > > I guess the problem is that resources aren't being recovered on
>> > > the nodes in the quorate partition.
>> > > Reason for that is probably that - as Ken was already suggesting -
>> > > fencing isn't
>> > > working properly or fencing-devices used are simply inappropriate
>> > > for the
>> > > purpose (e.g. onboard IPMI).
>> > > The fact that a node is rebooting isn't enough. The node that
>> > > initiated fencing
>> > > has to know that it did actually work. But we're just guessing
>> > > here. Logs should
>> > > show what is actually going on.
>> > >
>> > > Klaus
>> > > > >* Last updated: Mon Jun 26 12:44:15 2023
>> > > > >* Last change:  Mon Jun 26 12:41:12 2023 by root via
>> > > > cibadmin on FILE-2
>> > > > >* 4 nodes configured
>> > > > >* 11 resource instances configured
>> > > > >
>> > > > > Node List:
>> > > > >* Node FILE-1: UNCLEAN (offline)
>> > > > >* Node FILE-4: UNCLEAN (offline)
>> > > > >* Online: [ FILE-2 ]
>> > > > >* Online: [ FILE-3 ]
>> > > > >
>> > > > > At this stage FILE-1 and FILE-4 were continuously getting
>> > > > fenced (we have
>> > > > > device based stonith configured but the resource was not up ) .
>> > > > > Two nodes were online and two were offline. So quorum wasn't
>> > > > attained
>> > > > > again.
>> > > > > 1)  For such a scenario we need help to be able to have one
>> > > > cluster live .
>> > > > > 2)  And in cases where only one node of the cluster is up and
>> > > > others are
>> > > > > down we need the resources and cluster to be up .
>> > > > >
>> > > > > Thanks
>> > > > > Priyanka
>> > > > >
>> > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <
>> > > > arvidj...@gmail.com>
>> > > > > wrote:
>> > > > >
>> > > > >> On 26.06.2023 21:14, Priyanka Balotra wrote:
>> > > > >>> Hi All,
>> > > > >>> We are seeing an issue where we replaced no-quorum-
>> > > > policy=ignore with
>> > > > >> other
>> > > > >>> options in corosync.conf order to simulate the same behaviour
>> > > > :
>> > > > >>>
>> > > > >>>
>> > > > >>> * wait_for_all: 0*
>> > > > >>>
>> > > > >>> *last_man_standing: 1
>> > > > last_man_standing_window: 2*
>> > > > >>>
>> > > > >>> There was another property (auto-tie-breaker) tried but
>> > > > couldn't
>> > > > >> configure
>> > > > >>> it as crm did not recognise this property.
>> > > > >>>
>> > > > >>> But even after using these options, we are seeing that system
>> > > > is not
>> > > > >>> quorate if at least half of the nodes are not up.
>> > > > >>>
>> > > > >>> Some properties from crm config are as follows:
>> > > > >>>
>> > > > >>>
>> > > > >>>
>> > > > >>> *primitive stonith-sbd stonith:external/sbd \params

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Priyanka Balotra
I am using SLES 15 SP4. Is the no-quorum-policy still supported?

Thanks
Priyanka

On Wed, 28 Jun 2023 at 12:46 AM, Ken Gaillot  wrote:

> On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote:
> > In this case stonith has been configured as a resource,
> > primitive stonith-sbd stonith:external/sbd
> >
> > For it to be functional properly , the resource needs to be up, which
> > is only possible if the system is quorate.
>
> Pacemaker can use a fence device even if its resource is not active.
> The resource being active just allows Pacemaker to monitor the device
> regularly.
>
> >
> > Hence our requirement is to make the system quorate even if one Node
> > of the cluster is up.
> > Stonith will then take care of any split-brain scenarios.
>
> In that case it sounds like no-quorum-policy=ignore is actually what
> you want.
>
> >
> > Thanks
> > Priyanka
> >
> > On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger 
> > wrote:
> > >
> > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <
> > > arvidj...@gmail.com> wrote:
> > > > On 27.06.2023 07:21, Priyanka Balotra wrote:
> > > > > Hi Andrei,
> > > > > After this state the system went through some more fencings and
> > > > we saw the
> > > > > following state:
> > > > >
> > > > > :~ # crm status
> > > > > Cluster Summary:
> > > > >* Stack: corosync
> > > > >* Current DC: FILE-2 (version
> > > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36)
> > > > - partition
> > > > > with quorum
> > > >
> > > > It says "partition with quorum" so what exactly is the problem?
> > >
> > > I guess the problem is that resources aren't being recovered on
> > > the nodes in the quorate partition.
> > > Reason for that is probably that - as Ken was already suggesting -
> > > fencing isn't
> > > working properly or fencing-devices used are simply inappropriate
> > > for the
> > > purpose (e.g. onboard IPMI).
> > > The fact that a node is rebooting isn't enough. The node that
> > > initiated fencing
> > > has to know that it did actually work. But we're just guessing
> > > here. Logs should
> > > show what is actually going on.
> > >
> > > Klaus
> > > > >* Last updated: Mon Jun 26 12:44:15 2023
> > > > >* Last change:  Mon Jun 26 12:41:12 2023 by root via
> > > > cibadmin on FILE-2
> > > > >* 4 nodes configured
> > > > >* 11 resource instances configured
> > > > >
> > > > > Node List:
> > > > >* Node FILE-1: UNCLEAN (offline)
> > > > >* Node FILE-4: UNCLEAN (offline)
> > > > >* Online: [ FILE-2 ]
> > > > >* Online: [ FILE-3 ]
> > > > >
> > > > > At this stage FILE-1 and FILE-4 were continuously getting
> > > > fenced (we have
> > > > > device based stonith configured but the resource was not up ) .
> > > > > Two nodes were online and two were offline. So quorum wasn't
> > > > attained
> > > > > again.
> > > > > 1)  For such a scenario we need help to be able to have one
> > > > cluster live .
> > > > > 2)  And in cases where only one node of the cluster is up and
> > > > others are
> > > > > down we need the resources and cluster to be up .
> > > > >
> > > > > Thanks
> > > > > Priyanka
> > > > >
> > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <
> > > > arvidj...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> On 26.06.2023 21:14, Priyanka Balotra wrote:
> > > > >>> Hi All,
> > > > >>> We are seeing an issue where we replaced no-quorum-
> > > > policy=ignore with
> > > > >> other
> > > > >>> options in corosync.conf order to simulate the same behaviour
> > > > :
> > > > >>>
> > > > >>>
> > > > >>> * wait_for_all: 0*
> > > > >>>
> > > > >>> *last_man_standing: 1
> > > > last_man_standing_window: 2*
> > > > >>>
> > > > >>> There was another property (auto-tie-breaker) tried but
> > > > couldn't
> > > > >> configure
> > > > >>> it as crm did not recognise this property.
> > > > >>>
> > > > >>> But even after using these options, we are seeing that system
> > > > is not
> > > > >>> quorate if at least half of the nodes are not up.
> > > > >>>
> > > > >>> Some properties from crm config are as follows:
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *primitive stonith-sbd stonith:external/sbd \params
> > > > >>> pcmk_delay_base=5s.*
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> *.property cib-bootstrap-options: \have-watchdog=true
> > > > \
> > > > >>>
> > > > >> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-
> > > > 2.1.2+20211124.ada5c3b36"
> > > > >>> \cluster-infrastructure=corosync \cluster-
> > > > name=FILE \
> > > > >>> stonith-enabled=true \stonith-timeout=172 \
> > > > >>> stonith-action=reboot \stop-all-resources=false \
> > > > >>> no-quorum-policy=ignorersc_defaults build-resource-defaults:
> > > > \
> > > > >>> 

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Ken Gaillot
On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote:
> In this case stonith has been configured as a resource, 
> primitive stonith-sbd stonith:external/sbd
> 
> For it to be functional properly , the resource needs to be up, which
> is only possible if the system is quorate.

Pacemaker can use a fence device even if its resource is not active.
The resource being active just allows Pacemaker to monitor the device
regularly.

> 
> Hence our requirement is to make the system quorate even if one Node
> of the cluster is up.
> Stonith will then take care of any split-brain scenarios. 

In that case it sounds like no-quorum-policy=ignore is actually what
you want.

> 
> Thanks
> Priyanka
> 
> On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger 
> wrote:
> > 
> > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov <
> > arvidj...@gmail.com> wrote:
> > > On 27.06.2023 07:21, Priyanka Balotra wrote:
> > > > Hi Andrei,
> > > > After this state the system went through some more fencings and
> > > we saw the
> > > > following state:
> > > > 
> > > > :~ # crm status
> > > > Cluster Summary:
> > > >* Stack: corosync
> > > >* Current DC: FILE-2 (version
> > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36)
> > > - partition
> > > > with quorum
> > > 
> > > It says "partition with quorum" so what exactly is the problem?
> > 
> > I guess the problem is that resources aren't being recovered on
> > the nodes in the quorate partition.
> > Reason for that is probably that - as Ken was already suggesting -
> > fencing isn't
> > working properly or fencing-devices used are simply inappropriate
> > for the 
> > purpose (e.g. onboard IPMI).
> > The fact that a node is rebooting isn't enough. The node that
> > initiated fencing
> > has to know that it did actually work. But we're just guessing
> > here. Logs should
> > show what is actually going on.
> > 
> > Klaus
> > > >* Last updated: Mon Jun 26 12:44:15 2023
> > > >* Last change:  Mon Jun 26 12:41:12 2023 by root via
> > > cibadmin on FILE-2
> > > >* 4 nodes configured
> > > >* 11 resource instances configured
> > > > 
> > > > Node List:
> > > >* Node FILE-1: UNCLEAN (offline)
> > > >* Node FILE-4: UNCLEAN (offline)
> > > >* Online: [ FILE-2 ]
> > > >* Online: [ FILE-3 ]
> > > > 
> > > > At this stage FILE-1 and FILE-4 were continuously getting
> > > fenced (we have
> > > > device based stonith configured but the resource was not up ) .
> > > > Two nodes were online and two were offline. So quorum wasn't
> > > attained
> > > > again.
> > > > 1)  For such a scenario we need help to be able to have one
> > > cluster live .
> > > > 2)  And in cases where only one node of the cluster is up and
> > > others are
> > > > down we need the resources and cluster to be up .
> > > > 
> > > > Thanks
> > > > Priyanka
> > > > 
> > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <
> > > arvidj...@gmail.com>
> > > > wrote:
> > > > 
> > > >> On 26.06.2023 21:14, Priyanka Balotra wrote:
> > > >>> Hi All,
> > > >>> We are seeing an issue where we replaced no-quorum-
> > > policy=ignore with
> > > >> other
> > > >>> options in corosync.conf order to simulate the same behaviour
> > > :
> > > >>>
> > > >>>
> > > >>> * wait_for_all: 0*
> > > >>>
> > > >>> *last_man_standing: 1   
> > > last_man_standing_window: 2*
> > > >>>
> > > >>> There was another property (auto-tie-breaker) tried but
> > > couldn't
> > > >> configure
> > > >>> it as crm did not recognise this property.
> > > >>>
> > > >>> But even after using these options, we are seeing that system
> > > is not
> > > >>> quorate if at least half of the nodes are not up.
> > > >>>
> > > >>> Some properties from crm config are as follows:
> > > >>>
> > > >>>
> > > >>>
> > > >>> *primitive stonith-sbd stonith:external/sbd \params
> > > >>> pcmk_delay_base=5s.*
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> *.property cib-bootstrap-options: \have-watchdog=true 
> > > \
> > > >>>
> > > >> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-
> > > 2.1.2+20211124.ada5c3b36"
> > > >>> \cluster-infrastructure=corosync \cluster-
> > > name=FILE \
> > > >>> stonith-enabled=true \stonith-timeout=172 \
> > > >>> stonith-action=reboot \stop-all-resources=false \
> > > >>> no-quorum-policy=ignorersc_defaults build-resource-defaults:
> > > \
> > > >>> resource-stickiness=1rsc_defaults rsc-options: \
> > > >>> resource-stickiness=100 \migration-threshold=3 \
> > > >>> failure-timeout=1m \cluster-recheck-
> > > interval=10minop_defaults
> > > >>> op-options: \timeout=600 \record-
> > > pending=true*
> > > >>>
> > > >>> On a 4-node setup when the whole cluster is brought up
> > > together we see
> > > >>> error logs like:
> > > >>>
> > > >>> 

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Priyanka Balotra
In this case stonith has been configured as a resource,
*primitive stonith-sbd stonith:external/sbd*

For it to be functional properly , the resource needs to be up, which is
only possible if the system is quorate.
Hence our requirement is to make the system quorate even if one Node of the
cluster is up.
Stonith will then take care of any split-brain scenarios.

Thanks
Priyanka

On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger  wrote:

>
>
> On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov 
> wrote:
>
>> On 27.06.2023 07:21, Priyanka Balotra wrote:
>> > Hi Andrei,
>> > After this state the system went through some more fencings and we saw
>> the
>> > following state:
>> >
>> > :~ # crm status
>> > Cluster Summary:
>> >* Stack: corosync
>> >* Current DC: FILE-2 (version
>> > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) -
>> partition
>> > with quorum
>>
>> It says "partition with quorum" so what exactly is the problem?
>>
>
> I guess the problem is that resources aren't being recovered on
> the nodes in the quorate partition.
> Reason for that is probably that - as Ken was already suggesting - fencing
> isn't
> working properly or fencing-devices used are simply inappropriate for the
> purpose (e.g. onboard IPMI).
> The fact that a node is rebooting isn't enough. The node that initiated
> fencing
> has to know that it did actually work. But we're just guessing here. Logs
> should
> show what is actually going on.
>
> Klaus
>
>>
>> >* Last updated: Mon Jun 26 12:44:15 2023
>> >* Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on
>> FILE-2
>> >* 4 nodes configured
>> >* 11 resource instances configured
>> >
>> > Node List:
>> >* Node FILE-1: UNCLEAN (offline)
>> >* Node FILE-4: UNCLEAN (offline)
>> >* Online: [ FILE-2 ]
>> >* Online: [ FILE-3 ]
>> >
>> > At this stage FILE-1 and FILE-4 were continuously getting fenced (we
>> have
>> > device based stonith configured but the resource was not up ) .
>> > Two nodes were online and two were offline. So quorum wasn't attained
>> > again.
>> > 1)  For such a scenario we need help to be able to have one cluster
>> live .
>> > 2)  And in cases where only one node of the cluster is up and others are
>> > down we need the resources and cluster to be up .
>> >
>> > Thanks
>> > Priyanka
>> >
>> > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov 
>> > wrote:
>> >
>> >> On 26.06.2023 21:14, Priyanka Balotra wrote:
>> >>> Hi All,
>> >>> We are seeing an issue where we replaced no-quorum-policy=ignore with
>> >> other
>> >>> options in corosync.conf order to simulate the same behaviour :
>> >>>
>> >>>
>> >>> * wait_for_all: 0*
>> >>>
>> >>> *last_man_standing: 1last_man_standing_window: 2*
>> >>>
>> >>> There was another property (auto-tie-breaker) tried but couldn't
>> >> configure
>> >>> it as crm did not recognise this property.
>> >>>
>> >>> But even after using these options, we are seeing that system is not
>> >>> quorate if at least half of the nodes are not up.
>> >>>
>> >>> Some properties from crm config are as follows:
>> >>>
>> >>>
>> >>>
>> >>> *primitive stonith-sbd stonith:external/sbd \params
>> >>> pcmk_delay_base=5s.*
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> *.property cib-bootstrap-options: \have-watchdog=true \
>> >>>
>> >>
>> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"
>> >>> \cluster-infrastructure=corosync \cluster-name=FILE \
>> >>> stonith-enabled=true \stonith-timeout=172 \
>> >>> stonith-action=reboot \stop-all-resources=false \
>> >>> no-quorum-policy=ignorersc_defaults build-resource-defaults: \
>> >>> resource-stickiness=1rsc_defaults rsc-options: \
>> >>> resource-stickiness=100 \migration-threshold=3 \
>> >>> failure-timeout=1m \cluster-recheck-interval=10minop_defaults
>> >>> op-options: \timeout=600 \record-pending=true*
>> >>>
>> >>> On a 4-node setup when the whole cluster is brought up together we see
>> >>> error logs like:
>> >>>
>> >>> *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]:
>> >>> warning: Fencing and resource management disabled due to lack of
>> quorum*
>> >>>
>> >>> *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]:
>> >>> warning: Ignoring malformed node_state entry without uname*
>> >>>
>> >>> *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]:
>> >>> warning: Node FILE-2 is unclean!*
>> >>>
>> >>> *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]:
>> >>> warning: Node FILE-3 is unclean!*
>> >>>
>> >>> *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]:
>> >>> warning: Node FILE-4 is unclean!*
>> >>>
>> >>
>> >> According to this output FILE-1 lost connection to three other nodes,
>> in
>> >> which case it cannot be 

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Klaus Wenninger
On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov 
wrote:

> On 27.06.2023 07:21, Priyanka Balotra wrote:
> > Hi Andrei,
> > After this state the system went through some more fencings and we saw
> the
> > following state:
> >
> > :~ # crm status
> > Cluster Summary:
> >* Stack: corosync
> >* Current DC: FILE-2 (version
> > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) -
> partition
> > with quorum
>
> It says "partition with quorum" so what exactly is the problem?
>

I guess the problem is that resources aren't being recovered on
the nodes in the quorate partition.
Reason for that is probably that - as Ken was already suggesting - fencing
isn't
working properly or fencing-devices used are simply inappropriate for the
purpose (e.g. onboard IPMI).
The fact that a node is rebooting isn't enough. The node that initiated
fencing
has to know that it did actually work. But we're just guessing here. Logs
should
show what is actually going on.

Klaus

>
> >* Last updated: Mon Jun 26 12:44:15 2023
> >* Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on
> FILE-2
> >* 4 nodes configured
> >* 11 resource instances configured
> >
> > Node List:
> >* Node FILE-1: UNCLEAN (offline)
> >* Node FILE-4: UNCLEAN (offline)
> >* Online: [ FILE-2 ]
> >* Online: [ FILE-3 ]
> >
> > At this stage FILE-1 and FILE-4 were continuously getting fenced (we have
> > device based stonith configured but the resource was not up ) .
> > Two nodes were online and two were offline. So quorum wasn't attained
> > again.
> > 1)  For such a scenario we need help to be able to have one cluster live
> .
> > 2)  And in cases where only one node of the cluster is up and others are
> > down we need the resources and cluster to be up .
> >
> > Thanks
> > Priyanka
> >
> > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov 
> > wrote:
> >
> >> On 26.06.2023 21:14, Priyanka Balotra wrote:
> >>> Hi All,
> >>> We are seeing an issue where we replaced no-quorum-policy=ignore with
> >> other
> >>> options in corosync.conf order to simulate the same behaviour :
> >>>
> >>>
> >>> * wait_for_all: 0*
> >>>
> >>> *last_man_standing: 1last_man_standing_window: 2*
> >>>
> >>> There was another property (auto-tie-breaker) tried but couldn't
> >> configure
> >>> it as crm did not recognise this property.
> >>>
> >>> But even after using these options, we are seeing that system is not
> >>> quorate if at least half of the nodes are not up.
> >>>
> >>> Some properties from crm config are as follows:
> >>>
> >>>
> >>>
> >>> *primitive stonith-sbd stonith:external/sbd \params
> >>> pcmk_delay_base=5s.*
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> *.property cib-bootstrap-options: \have-watchdog=true \
> >>>
> >>
> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"
> >>> \cluster-infrastructure=corosync \cluster-name=FILE \
> >>> stonith-enabled=true \stonith-timeout=172 \
> >>> stonith-action=reboot \stop-all-resources=false \
> >>> no-quorum-policy=ignorersc_defaults build-resource-defaults: \
> >>> resource-stickiness=1rsc_defaults rsc-options: \
> >>> resource-stickiness=100 \migration-threshold=3 \
> >>> failure-timeout=1m \cluster-recheck-interval=10minop_defaults
> >>> op-options: \timeout=600 \record-pending=true*
> >>>
> >>> On a 4-node setup when the whole cluster is brought up together we see
> >>> error logs like:
> >>>
> >>> *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]:
> >>> warning: Fencing and resource management disabled due to lack of
> quorum*
> >>>
> >>> *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]:
> >>> warning: Ignoring malformed node_state entry without uname*
> >>>
> >>> *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]:
> >>> warning: Node FILE-2 is unclean!*
> >>>
> >>> *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]:
> >>> warning: Node FILE-3 is unclean!*
> >>>
> >>> *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]:
> >>> warning: Node FILE-4 is unclean!*
> >>>
> >>
> >> According to this output FILE-1 lost connection to three other nodes, in
> >> which case it cannot be quorate.
> >>
> >>>
> >>> Kindly help correct the configuration to make the system function
> >> normally
> >>> with all resources up, even if there is just one node up.
> >>>
> >>> Please let me know if any more info is needed.
> >>>
> >>> Thanks
> >>> Priyanka
> >>>
> >>>
> >>> ___
> >>> Manage your subscription:
> >>> https://lists.clusterlabs.org/mailman/listinfo/users
> >>>
> >>> ClusterLabs home: https://www.clusterlabs.org/
> >>
> >> ___
> >> Manage your subscription:
> >> 

Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Andrei Borzenkov

On 27.06.2023 07:21, Priyanka Balotra wrote:

Hi Andrei,
After this state the system went through some more fencings and we saw the
following state:

:~ # crm status
Cluster Summary:
   * Stack: corosync
   * Current DC: FILE-2 (version
2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) - partition
with quorum


It says "partition with quorum" so what exactly is the problem?


   * Last updated: Mon Jun 26 12:44:15 2023
   * Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on FILE-2
   * 4 nodes configured
   * 11 resource instances configured

Node List:
   * Node FILE-1: UNCLEAN (offline)
   * Node FILE-4: UNCLEAN (offline)
   * Online: [ FILE-2 ]
   * Online: [ FILE-3 ]

At this stage FILE-1 and FILE-4 were continuously getting fenced (we have
device based stonith configured but the resource was not up ) .
Two nodes were online and two were offline. So quorum wasn't attained
again.
1)  For such a scenario we need help to be able to have one cluster live .
2)  And in cases where only one node of the cluster is up and others are
down we need the resources and cluster to be up .

Thanks
Priyanka

On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov 
wrote:


On 26.06.2023 21:14, Priyanka Balotra wrote:

Hi All,
We are seeing an issue where we replaced no-quorum-policy=ignore with

other

options in corosync.conf order to simulate the same behaviour :


* wait_for_all: 0*

*last_man_standing: 1last_man_standing_window: 2*

There was another property (auto-tie-breaker) tried but couldn't

configure

it as crm did not recognise this property.

But even after using these options, we are seeing that system is not
quorate if at least half of the nodes are not up.

Some properties from crm config are as follows:



*primitive stonith-sbd stonith:external/sbd \params
pcmk_delay_base=5s.*




















*.property cib-bootstrap-options: \have-watchdog=true \


dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36"

\cluster-infrastructure=corosync \cluster-name=FILE \
stonith-enabled=true \stonith-timeout=172 \
stonith-action=reboot \stop-all-resources=false \
no-quorum-policy=ignorersc_defaults build-resource-defaults: \
resource-stickiness=1rsc_defaults rsc-options: \
resource-stickiness=100 \migration-threshold=3 \
failure-timeout=1m \cluster-recheck-interval=10minop_defaults
op-options: \timeout=600 \record-pending=true*

On a 4-node setup when the whole cluster is brought up together we see
error logs like:

*2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Fencing and resource management disabled due to lack of quorum*

*2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Ignoring malformed node_state entry without uname*

*2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Node FILE-2 is unclean!*

*2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Node FILE-3 is unclean!*

*2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]:
warning: Node FILE-4 is unclean!*



According to this output FILE-1 lost connection to three other nodes, in
which case it cannot be quorate.



Kindly help correct the configuration to make the system function

normally

with all resources up, even if there is just one node up.

Please let me know if any more info is needed.

Thanks
Priyanka


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/




___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/


Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution

2023-06-27 Thread Ken Gaillot
On Tue, 2023-06-27 at 09:51 +0530, Priyanka Balotra wrote:
> Hi Andrei, 
> After this state the system went through some more fencings and we
> saw the following state: 
> 
> :~ # crm status
> Cluster Summary:
>   * Stack: corosync
>   * Current DC: FILE-2 (version 2.1.2+20211124.ada5c3b36-150400.2.43-
> 2.1.2+20211124.ada5c3b36) - partition with quorum
>   * Last updated: Mon Jun 26 12:44:15 2023
>   * Last change:  Mon Jun 26 12:41:12 2023 by root via cibadmin on
> FILE-2
>   * 4 nodes configured
>   * 11 resource instances configured
> 
> Node List:
>   * Node FILE-1: UNCLEAN (offline)
>   * Node FILE-4: UNCLEAN (offline)
>   * Online: [ FILE-2 ]
>   * Online: [ FILE-3 ]
> 
> At this stage FILE-1 and FILE-4 were continuously getting fenced (we
> have device based stonith configured but the resource was not up ) . 
> Two nodes were online and two were offline. So quorum wasn't attained
> again. 
> 1)  For such a scenario we need help to be able to have one cluster
> live . 
> 2)  And in cases where only one node of the cluster is up and others
> are down we need the resources and cluster to be up . 

The solution is to fix the fencing.

Without fencing, there is no way to know that the other nodes are
*actually* offline. It's possible that communication between the nodes
has been temporarily interrupted, in which case recovering resources
could lead to a "split-brain" situation that could corrupt data or make
services unusable.

Onboard IPMI is not a production fencing mechanism by itself, because
it loses power when the node loses power. It's fine to use in a
topology with a fallback method such as power fencing or watchdog-based 
SBD.

> Thanks
> Priyanka
> 
> On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov <
> arvidj...@gmail.com> wrote:
> > On 26.06.2023 21:14, Priyanka Balotra wrote:
> > > Hi All,
> > > We are seeing an issue where we replaced no-quorum-policy=ignore
> > with other
> > > options in corosync.conf order to simulate the same behaviour :
> > > 
> > > 
> > > * wait_for_all: 0*
> > > 
> > > *last_man_standing: 1last_man_standing_window:
> > 2*
> > > 
> > > There was another property (auto-tie-breaker) tried but couldn't
> > configure
> > > it as crm did not recognise this property.
> > > 
> > > But even after using these options, we are seeing that system is
> > not
> > > quorate if at least half of the nodes are not up.
> > > 
> > > Some properties from crm config are as follows:
> > > 
> > > 
> > > 
> > > *primitive stonith-sbd stonith:external/sbd \params
> > > pcmk_delay_base=5s.*
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > *.property cib-bootstrap-options: \have-watchdog=true \
> > > dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-
> > 2.1.2+20211124.ada5c3b36"
> > > \cluster-infrastructure=corosync \cluster-
> > name=FILE \
> > >stonith-enabled=true \stonith-timeout=172 \
> > > stonith-action=reboot \stop-all-resources=false \
> > > no-quorum-policy=ignorersc_defaults build-resource-defaults: \
> > > resource-stickiness=1rsc_defaults rsc-options: \
> > > resource-stickiness=100 \migration-threshold=3 \
> > > failure-timeout=1m \cluster-recheck-
> > interval=10minop_defaults
> > > op-options: \timeout=600 \record-pending=true*
> > > 
> > > On a 4-node setup when the whole cluster is brought up together
> > we see
> > > error logs like:
> > > 
> > > *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-
> > schedulerd[26359]:
> > > warning: Fencing and resource management disabled due to lack of
> > quorum*
> > > 
> > > *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-
> > schedulerd[26359]:
> > > warning: Ignoring malformed node_state entry without uname*
> > > 
> > > *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-
> > schedulerd[26359]:
> > > warning: Node FILE-2 is unclean!*
> > > 
> > > *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-
> > schedulerd[26359]:
> > > warning: Node FILE-3 is unclean!*
> > > 
> > > *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-
> > schedulerd[26359]:
> > > warning: Node FILE-4 is unclean!*
> > > 
> > 
> > According to this output FILE-1 lost connection to three other
> > nodes, in 
> > which case it cannot be quorate.
> > 
> > > 
> > > Kindly help correct the configuration to make the system function
> > normally
> > > with all resources up, even if there is just one node up.
> > > 
> > > Please let me know if any more info is needed.
> > > 
> > > Thanks
> > > Priyanka
> > > 
> > > 
> > > ___
> > > Manage your subscription:
> > > https://lists.clusterlabs.org/mailman/listinfo/users
> > > 
> > > ClusterLabs home: https://www.clusterlabs.org/
> > 
> > ___
> > Manage your subscription:
> > https://lists.clusterlabs.org/mailman/listinfo/users
> > 
>