Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
On Wed, Jun 28, 2023 at 3:30 AM Priyanka Balotra < priyanka.14balo...@gmail.com> wrote: > I am using SLES 15 SP4. Is the no-quorum-policy still supported? > > Thanks > Priyanka > > On Wed, 28 Jun 2023 at 12:46 AM, Ken Gaillot wrote: > >> On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote: >> > In this case stonith has been configured as a resource, >> > primitive stonith-sbd stonith:external/sbd >> > Then the error scenario you described looks like everybody lost connection to the shared-storage. The nodes rebooting then probably rather suicided instead of reading the poison-pill. And the quorate partition is staying alive because it is quorate but not seeing the shared-storage it can't verify that it had been able to write the poison-pill which makes the other nodes stay unclean. But again just guessing ... > > >> > For it to be functional properly , the resource needs to be up, which >> > is only possible if the system is quorate. >> >> Pacemaker can use a fence device even if its resource is not active. >> The resource being active just allows Pacemaker to monitor the device >> regularly. >> >> > >> > Hence our requirement is to make the system quorate even if one Node >> > of the cluster is up. >> > Stonith will then take care of any split-brain scenarios. >> >> In that case it sounds like no-quorum-policy=ignore is actually what >> you want. >> > Still dangerous without something like wait-for-all - right? With LMS I guess you should have the same effect without having explicitly specified though. Klaus > >> > >> > Thanks >> > Priyanka >> > >> > On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger >> > wrote: >> > > >> > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov < >> > > arvidj...@gmail.com> wrote: >> > > > On 27.06.2023 07:21, Priyanka Balotra wrote: >> > > > > Hi Andrei, >> > > > > After this state the system went through some more fencings and >> > > > we saw the >> > > > > following state: >> > > > > >> > > > > :~ # crm status >> > > > > Cluster Summary: >> > > > >* Stack: corosync >> > > > >* Current DC: FILE-2 (version >> > > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) >> > > > - partition >> > > > > with quorum >> > > > >> > > > It says "partition with quorum" so what exactly is the problem? >> > > >> > > I guess the problem is that resources aren't being recovered on >> > > the nodes in the quorate partition. >> > > Reason for that is probably that - as Ken was already suggesting - >> > > fencing isn't >> > > working properly or fencing-devices used are simply inappropriate >> > > for the >> > > purpose (e.g. onboard IPMI). >> > > The fact that a node is rebooting isn't enough. The node that >> > > initiated fencing >> > > has to know that it did actually work. But we're just guessing >> > > here. Logs should >> > > show what is actually going on. >> > > >> > > Klaus >> > > > >* Last updated: Mon Jun 26 12:44:15 2023 >> > > > >* Last change: Mon Jun 26 12:41:12 2023 by root via >> > > > cibadmin on FILE-2 >> > > > >* 4 nodes configured >> > > > >* 11 resource instances configured >> > > > > >> > > > > Node List: >> > > > >* Node FILE-1: UNCLEAN (offline) >> > > > >* Node FILE-4: UNCLEAN (offline) >> > > > >* Online: [ FILE-2 ] >> > > > >* Online: [ FILE-3 ] >> > > > > >> > > > > At this stage FILE-1 and FILE-4 were continuously getting >> > > > fenced (we have >> > > > > device based stonith configured but the resource was not up ) . >> > > > > Two nodes were online and two were offline. So quorum wasn't >> > > > attained >> > > > > again. >> > > > > 1) For such a scenario we need help to be able to have one >> > > > cluster live . >> > > > > 2) And in cases where only one node of the cluster is up and >> > > > others are >> > > > > down we need the resources and cluster to be up . >> > > > > >> > > > > Thanks >> > > > > Priyanka >> > > > > >> > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov < >> > > > arvidj...@gmail.com> >> > > > > wrote: >> > > > > >> > > > >> On 26.06.2023 21:14, Priyanka Balotra wrote: >> > > > >>> Hi All, >> > > > >>> We are seeing an issue where we replaced no-quorum- >> > > > policy=ignore with >> > > > >> other >> > > > >>> options in corosync.conf order to simulate the same behaviour >> > > > : >> > > > >>> >> > > > >>> >> > > > >>> * wait_for_all: 0* >> > > > >>> >> > > > >>> *last_man_standing: 1 >> > > > last_man_standing_window: 2* >> > > > >>> >> > > > >>> There was another property (auto-tie-breaker) tried but >> > > > couldn't >> > > > >> configure >> > > > >>> it as crm did not recognise this property. >> > > > >>> >> > > > >>> But even after using these options, we are seeing that system >> > > > is not >> > > > >>> quorate if at least half of the nodes are not up. >> > > > >>> >> > > > >>> Some properties from crm config are as follows: >> > > > >>> >> > > > >>> >> > > > >>> >> > > > >>> *primitive stonith-sbd stonith:external/sbd \params
Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
I am using SLES 15 SP4. Is the no-quorum-policy still supported? Thanks Priyanka On Wed, 28 Jun 2023 at 12:46 AM, Ken Gaillot wrote: > On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote: > > In this case stonith has been configured as a resource, > > primitive stonith-sbd stonith:external/sbd > > > > For it to be functional properly , the resource needs to be up, which > > is only possible if the system is quorate. > > Pacemaker can use a fence device even if its resource is not active. > The resource being active just allows Pacemaker to monitor the device > regularly. > > > > > Hence our requirement is to make the system quorate even if one Node > > of the cluster is up. > > Stonith will then take care of any split-brain scenarios. > > In that case it sounds like no-quorum-policy=ignore is actually what > you want. > > > > > Thanks > > Priyanka > > > > On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger > > wrote: > > > > > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov < > > > arvidj...@gmail.com> wrote: > > > > On 27.06.2023 07:21, Priyanka Balotra wrote: > > > > > Hi Andrei, > > > > > After this state the system went through some more fencings and > > > > we saw the > > > > > following state: > > > > > > > > > > :~ # crm status > > > > > Cluster Summary: > > > > >* Stack: corosync > > > > >* Current DC: FILE-2 (version > > > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) > > > > - partition > > > > > with quorum > > > > > > > > It says "partition with quorum" so what exactly is the problem? > > > > > > I guess the problem is that resources aren't being recovered on > > > the nodes in the quorate partition. > > > Reason for that is probably that - as Ken was already suggesting - > > > fencing isn't > > > working properly or fencing-devices used are simply inappropriate > > > for the > > > purpose (e.g. onboard IPMI). > > > The fact that a node is rebooting isn't enough. The node that > > > initiated fencing > > > has to know that it did actually work. But we're just guessing > > > here. Logs should > > > show what is actually going on. > > > > > > Klaus > > > > >* Last updated: Mon Jun 26 12:44:15 2023 > > > > >* Last change: Mon Jun 26 12:41:12 2023 by root via > > > > cibadmin on FILE-2 > > > > >* 4 nodes configured > > > > >* 11 resource instances configured > > > > > > > > > > Node List: > > > > >* Node FILE-1: UNCLEAN (offline) > > > > >* Node FILE-4: UNCLEAN (offline) > > > > >* Online: [ FILE-2 ] > > > > >* Online: [ FILE-3 ] > > > > > > > > > > At this stage FILE-1 and FILE-4 were continuously getting > > > > fenced (we have > > > > > device based stonith configured but the resource was not up ) . > > > > > Two nodes were online and two were offline. So quorum wasn't > > > > attained > > > > > again. > > > > > 1) For such a scenario we need help to be able to have one > > > > cluster live . > > > > > 2) And in cases where only one node of the cluster is up and > > > > others are > > > > > down we need the resources and cluster to be up . > > > > > > > > > > Thanks > > > > > Priyanka > > > > > > > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov < > > > > arvidj...@gmail.com> > > > > > wrote: > > > > > > > > > >> On 26.06.2023 21:14, Priyanka Balotra wrote: > > > > >>> Hi All, > > > > >>> We are seeing an issue where we replaced no-quorum- > > > > policy=ignore with > > > > >> other > > > > >>> options in corosync.conf order to simulate the same behaviour > > > > : > > > > >>> > > > > >>> > > > > >>> * wait_for_all: 0* > > > > >>> > > > > >>> *last_man_standing: 1 > > > > last_man_standing_window: 2* > > > > >>> > > > > >>> There was another property (auto-tie-breaker) tried but > > > > couldn't > > > > >> configure > > > > >>> it as crm did not recognise this property. > > > > >>> > > > > >>> But even after using these options, we are seeing that system > > > > is not > > > > >>> quorate if at least half of the nodes are not up. > > > > >>> > > > > >>> Some properties from crm config are as follows: > > > > >>> > > > > >>> > > > > >>> > > > > >>> *primitive stonith-sbd stonith:external/sbd \params > > > > >>> pcmk_delay_base=5s.* > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> > > > > >>> *.property cib-bootstrap-options: \have-watchdog=true > > > > \ > > > > >>> > > > > >> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43- > > > > 2.1.2+20211124.ada5c3b36" > > > > >>> \cluster-infrastructure=corosync \cluster- > > > > name=FILE \ > > > > >>> stonith-enabled=true \stonith-timeout=172 \ > > > > >>> stonith-action=reboot \stop-all-resources=false \ > > > > >>> no-quorum-policy=ignorersc_defaults build-resource-defaults: > > > > \ > > > > >>>
Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote: > In this case stonith has been configured as a resource, > primitive stonith-sbd stonith:external/sbd > > For it to be functional properly , the resource needs to be up, which > is only possible if the system is quorate. Pacemaker can use a fence device even if its resource is not active. The resource being active just allows Pacemaker to monitor the device regularly. > > Hence our requirement is to make the system quorate even if one Node > of the cluster is up. > Stonith will then take care of any split-brain scenarios. In that case it sounds like no-quorum-policy=ignore is actually what you want. > > Thanks > Priyanka > > On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger > wrote: > > > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov < > > arvidj...@gmail.com> wrote: > > > On 27.06.2023 07:21, Priyanka Balotra wrote: > > > > Hi Andrei, > > > > After this state the system went through some more fencings and > > > we saw the > > > > following state: > > > > > > > > :~ # crm status > > > > Cluster Summary: > > > >* Stack: corosync > > > >* Current DC: FILE-2 (version > > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) > > > - partition > > > > with quorum > > > > > > It says "partition with quorum" so what exactly is the problem? > > > > I guess the problem is that resources aren't being recovered on > > the nodes in the quorate partition. > > Reason for that is probably that - as Ken was already suggesting - > > fencing isn't > > working properly or fencing-devices used are simply inappropriate > > for the > > purpose (e.g. onboard IPMI). > > The fact that a node is rebooting isn't enough. The node that > > initiated fencing > > has to know that it did actually work. But we're just guessing > > here. Logs should > > show what is actually going on. > > > > Klaus > > > >* Last updated: Mon Jun 26 12:44:15 2023 > > > >* Last change: Mon Jun 26 12:41:12 2023 by root via > > > cibadmin on FILE-2 > > > >* 4 nodes configured > > > >* 11 resource instances configured > > > > > > > > Node List: > > > >* Node FILE-1: UNCLEAN (offline) > > > >* Node FILE-4: UNCLEAN (offline) > > > >* Online: [ FILE-2 ] > > > >* Online: [ FILE-3 ] > > > > > > > > At this stage FILE-1 and FILE-4 were continuously getting > > > fenced (we have > > > > device based stonith configured but the resource was not up ) . > > > > Two nodes were online and two were offline. So quorum wasn't > > > attained > > > > again. > > > > 1) For such a scenario we need help to be able to have one > > > cluster live . > > > > 2) And in cases where only one node of the cluster is up and > > > others are > > > > down we need the resources and cluster to be up . > > > > > > > > Thanks > > > > Priyanka > > > > > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov < > > > arvidj...@gmail.com> > > > > wrote: > > > > > > > >> On 26.06.2023 21:14, Priyanka Balotra wrote: > > > >>> Hi All, > > > >>> We are seeing an issue where we replaced no-quorum- > > > policy=ignore with > > > >> other > > > >>> options in corosync.conf order to simulate the same behaviour > > > : > > > >>> > > > >>> > > > >>> * wait_for_all: 0* > > > >>> > > > >>> *last_man_standing: 1 > > > last_man_standing_window: 2* > > > >>> > > > >>> There was another property (auto-tie-breaker) tried but > > > couldn't > > > >> configure > > > >>> it as crm did not recognise this property. > > > >>> > > > >>> But even after using these options, we are seeing that system > > > is not > > > >>> quorate if at least half of the nodes are not up. > > > >>> > > > >>> Some properties from crm config are as follows: > > > >>> > > > >>> > > > >>> > > > >>> *primitive stonith-sbd stonith:external/sbd \params > > > >>> pcmk_delay_base=5s.* > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> > > > >>> *.property cib-bootstrap-options: \have-watchdog=true > > > \ > > > >>> > > > >> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43- > > > 2.1.2+20211124.ada5c3b36" > > > >>> \cluster-infrastructure=corosync \cluster- > > > name=FILE \ > > > >>> stonith-enabled=true \stonith-timeout=172 \ > > > >>> stonith-action=reboot \stop-all-resources=false \ > > > >>> no-quorum-policy=ignorersc_defaults build-resource-defaults: > > > \ > > > >>> resource-stickiness=1rsc_defaults rsc-options: \ > > > >>> resource-stickiness=100 \migration-threshold=3 \ > > > >>> failure-timeout=1m \cluster-recheck- > > > interval=10minop_defaults > > > >>> op-options: \timeout=600 \record- > > > pending=true* > > > >>> > > > >>> On a 4-node setup when the whole cluster is brought up > > > together we see > > > >>> error logs like: > > > >>> > > > >>>
Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
In this case stonith has been configured as a resource, *primitive stonith-sbd stonith:external/sbd* For it to be functional properly , the resource needs to be up, which is only possible if the system is quorate. Hence our requirement is to make the system quorate even if one Node of the cluster is up. Stonith will then take care of any split-brain scenarios. Thanks Priyanka On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger wrote: > > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov > wrote: > >> On 27.06.2023 07:21, Priyanka Balotra wrote: >> > Hi Andrei, >> > After this state the system went through some more fencings and we saw >> the >> > following state: >> > >> > :~ # crm status >> > Cluster Summary: >> >* Stack: corosync >> >* Current DC: FILE-2 (version >> > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) - >> partition >> > with quorum >> >> It says "partition with quorum" so what exactly is the problem? >> > > I guess the problem is that resources aren't being recovered on > the nodes in the quorate partition. > Reason for that is probably that - as Ken was already suggesting - fencing > isn't > working properly or fencing-devices used are simply inappropriate for the > purpose (e.g. onboard IPMI). > The fact that a node is rebooting isn't enough. The node that initiated > fencing > has to know that it did actually work. But we're just guessing here. Logs > should > show what is actually going on. > > Klaus > >> >> >* Last updated: Mon Jun 26 12:44:15 2023 >> >* Last change: Mon Jun 26 12:41:12 2023 by root via cibadmin on >> FILE-2 >> >* 4 nodes configured >> >* 11 resource instances configured >> > >> > Node List: >> >* Node FILE-1: UNCLEAN (offline) >> >* Node FILE-4: UNCLEAN (offline) >> >* Online: [ FILE-2 ] >> >* Online: [ FILE-3 ] >> > >> > At this stage FILE-1 and FILE-4 were continuously getting fenced (we >> have >> > device based stonith configured but the resource was not up ) . >> > Two nodes were online and two were offline. So quorum wasn't attained >> > again. >> > 1) For such a scenario we need help to be able to have one cluster >> live . >> > 2) And in cases where only one node of the cluster is up and others are >> > down we need the resources and cluster to be up . >> > >> > Thanks >> > Priyanka >> > >> > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov >> > wrote: >> > >> >> On 26.06.2023 21:14, Priyanka Balotra wrote: >> >>> Hi All, >> >>> We are seeing an issue where we replaced no-quorum-policy=ignore with >> >> other >> >>> options in corosync.conf order to simulate the same behaviour : >> >>> >> >>> >> >>> * wait_for_all: 0* >> >>> >> >>> *last_man_standing: 1last_man_standing_window: 2* >> >>> >> >>> There was another property (auto-tie-breaker) tried but couldn't >> >> configure >> >>> it as crm did not recognise this property. >> >>> >> >>> But even after using these options, we are seeing that system is not >> >>> quorate if at least half of the nodes are not up. >> >>> >> >>> Some properties from crm config are as follows: >> >>> >> >>> >> >>> >> >>> *primitive stonith-sbd stonith:external/sbd \params >> >>> pcmk_delay_base=5s.* >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> *.property cib-bootstrap-options: \have-watchdog=true \ >> >>> >> >> >> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36" >> >>> \cluster-infrastructure=corosync \cluster-name=FILE \ >> >>> stonith-enabled=true \stonith-timeout=172 \ >> >>> stonith-action=reboot \stop-all-resources=false \ >> >>> no-quorum-policy=ignorersc_defaults build-resource-defaults: \ >> >>> resource-stickiness=1rsc_defaults rsc-options: \ >> >>> resource-stickiness=100 \migration-threshold=3 \ >> >>> failure-timeout=1m \cluster-recheck-interval=10minop_defaults >> >>> op-options: \timeout=600 \record-pending=true* >> >>> >> >>> On a 4-node setup when the whole cluster is brought up together we see >> >>> error logs like: >> >>> >> >>> *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]: >> >>> warning: Fencing and resource management disabled due to lack of >> quorum* >> >>> >> >>> *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]: >> >>> warning: Ignoring malformed node_state entry without uname* >> >>> >> >>> *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]: >> >>> warning: Node FILE-2 is unclean!* >> >>> >> >>> *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]: >> >>> warning: Node FILE-3 is unclean!* >> >>> >> >>> *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]: >> >>> warning: Node FILE-4 is unclean!* >> >>> >> >> >> >> According to this output FILE-1 lost connection to three other nodes, >> in >> >> which case it cannot be
Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov wrote: > On 27.06.2023 07:21, Priyanka Balotra wrote: > > Hi Andrei, > > After this state the system went through some more fencings and we saw > the > > following state: > > > > :~ # crm status > > Cluster Summary: > >* Stack: corosync > >* Current DC: FILE-2 (version > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) - > partition > > with quorum > > It says "partition with quorum" so what exactly is the problem? > I guess the problem is that resources aren't being recovered on the nodes in the quorate partition. Reason for that is probably that - as Ken was already suggesting - fencing isn't working properly or fencing-devices used are simply inappropriate for the purpose (e.g. onboard IPMI). The fact that a node is rebooting isn't enough. The node that initiated fencing has to know that it did actually work. But we're just guessing here. Logs should show what is actually going on. Klaus > > >* Last updated: Mon Jun 26 12:44:15 2023 > >* Last change: Mon Jun 26 12:41:12 2023 by root via cibadmin on > FILE-2 > >* 4 nodes configured > >* 11 resource instances configured > > > > Node List: > >* Node FILE-1: UNCLEAN (offline) > >* Node FILE-4: UNCLEAN (offline) > >* Online: [ FILE-2 ] > >* Online: [ FILE-3 ] > > > > At this stage FILE-1 and FILE-4 were continuously getting fenced (we have > > device based stonith configured but the resource was not up ) . > > Two nodes were online and two were offline. So quorum wasn't attained > > again. > > 1) For such a scenario we need help to be able to have one cluster live > . > > 2) And in cases where only one node of the cluster is up and others are > > down we need the resources and cluster to be up . > > > > Thanks > > Priyanka > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov > > wrote: > > > >> On 26.06.2023 21:14, Priyanka Balotra wrote: > >>> Hi All, > >>> We are seeing an issue where we replaced no-quorum-policy=ignore with > >> other > >>> options in corosync.conf order to simulate the same behaviour : > >>> > >>> > >>> * wait_for_all: 0* > >>> > >>> *last_man_standing: 1last_man_standing_window: 2* > >>> > >>> There was another property (auto-tie-breaker) tried but couldn't > >> configure > >>> it as crm did not recognise this property. > >>> > >>> But even after using these options, we are seeing that system is not > >>> quorate if at least half of the nodes are not up. > >>> > >>> Some properties from crm config are as follows: > >>> > >>> > >>> > >>> *primitive stonith-sbd stonith:external/sbd \params > >>> pcmk_delay_base=5s.* > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> *.property cib-bootstrap-options: \have-watchdog=true \ > >>> > >> > dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36" > >>> \cluster-infrastructure=corosync \cluster-name=FILE \ > >>> stonith-enabled=true \stonith-timeout=172 \ > >>> stonith-action=reboot \stop-all-resources=false \ > >>> no-quorum-policy=ignorersc_defaults build-resource-defaults: \ > >>> resource-stickiness=1rsc_defaults rsc-options: \ > >>> resource-stickiness=100 \migration-threshold=3 \ > >>> failure-timeout=1m \cluster-recheck-interval=10minop_defaults > >>> op-options: \timeout=600 \record-pending=true* > >>> > >>> On a 4-node setup when the whole cluster is brought up together we see > >>> error logs like: > >>> > >>> *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]: > >>> warning: Fencing and resource management disabled due to lack of > quorum* > >>> > >>> *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]: > >>> warning: Ignoring malformed node_state entry without uname* > >>> > >>> *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]: > >>> warning: Node FILE-2 is unclean!* > >>> > >>> *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]: > >>> warning: Node FILE-3 is unclean!* > >>> > >>> *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]: > >>> warning: Node FILE-4 is unclean!* > >>> > >> > >> According to this output FILE-1 lost connection to three other nodes, in > >> which case it cannot be quorate. > >> > >>> > >>> Kindly help correct the configuration to make the system function > >> normally > >>> with all resources up, even if there is just one node up. > >>> > >>> Please let me know if any more info is needed. > >>> > >>> Thanks > >>> Priyanka > >>> > >>> > >>> ___ > >>> Manage your subscription: > >>> https://lists.clusterlabs.org/mailman/listinfo/users > >>> > >>> ClusterLabs home: https://www.clusterlabs.org/ > >> > >> ___ > >> Manage your subscription: > >>
Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
On 27.06.2023 07:21, Priyanka Balotra wrote: Hi Andrei, After this state the system went through some more fencings and we saw the following state: :~ # crm status Cluster Summary: * Stack: corosync * Current DC: FILE-2 (version 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) - partition with quorum It says "partition with quorum" so what exactly is the problem? * Last updated: Mon Jun 26 12:44:15 2023 * Last change: Mon Jun 26 12:41:12 2023 by root via cibadmin on FILE-2 * 4 nodes configured * 11 resource instances configured Node List: * Node FILE-1: UNCLEAN (offline) * Node FILE-4: UNCLEAN (offline) * Online: [ FILE-2 ] * Online: [ FILE-3 ] At this stage FILE-1 and FILE-4 were continuously getting fenced (we have device based stonith configured but the resource was not up ) . Two nodes were online and two were offline. So quorum wasn't attained again. 1) For such a scenario we need help to be able to have one cluster live . 2) And in cases where only one node of the cluster is up and others are down we need the resources and cluster to be up . Thanks Priyanka On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov wrote: On 26.06.2023 21:14, Priyanka Balotra wrote: Hi All, We are seeing an issue where we replaced no-quorum-policy=ignore with other options in corosync.conf order to simulate the same behaviour : * wait_for_all: 0* *last_man_standing: 1last_man_standing_window: 2* There was another property (auto-tie-breaker) tried but couldn't configure it as crm did not recognise this property. But even after using these options, we are seeing that system is not quorate if at least half of the nodes are not up. Some properties from crm config are as follows: *primitive stonith-sbd stonith:external/sbd \params pcmk_delay_base=5s.* *.property cib-bootstrap-options: \have-watchdog=true \ dc-version="2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36" \cluster-infrastructure=corosync \cluster-name=FILE \ stonith-enabled=true \stonith-timeout=172 \ stonith-action=reboot \stop-all-resources=false \ no-quorum-policy=ignorersc_defaults build-resource-defaults: \ resource-stickiness=1rsc_defaults rsc-options: \ resource-stickiness=100 \migration-threshold=3 \ failure-timeout=1m \cluster-recheck-interval=10minop_defaults op-options: \timeout=600 \record-pending=true* On a 4-node setup when the whole cluster is brought up together we see error logs like: *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker-schedulerd[26359]: warning: Fencing and resource management disabled due to lack of quorum* *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker-schedulerd[26359]: warning: Ignoring malformed node_state entry without uname* *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker-schedulerd[26359]: warning: Node FILE-2 is unclean!* *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker-schedulerd[26359]: warning: Node FILE-3 is unclean!* *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker-schedulerd[26359]: warning: Node FILE-4 is unclean!* According to this output FILE-1 lost connection to three other nodes, in which case it cannot be quorate. Kindly help correct the configuration to make the system function normally with all resources up, even if there is just one node up. Please let me know if any more info is needed. Thanks Priyanka ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] no-quorum-policy=ignore is (Deprecated ) and replaced with other options but not an effective solution
On Tue, 2023-06-27 at 09:51 +0530, Priyanka Balotra wrote: > Hi Andrei, > After this state the system went through some more fencings and we > saw the following state: > > :~ # crm status > Cluster Summary: > * Stack: corosync > * Current DC: FILE-2 (version 2.1.2+20211124.ada5c3b36-150400.2.43- > 2.1.2+20211124.ada5c3b36) - partition with quorum > * Last updated: Mon Jun 26 12:44:15 2023 > * Last change: Mon Jun 26 12:41:12 2023 by root via cibadmin on > FILE-2 > * 4 nodes configured > * 11 resource instances configured > > Node List: > * Node FILE-1: UNCLEAN (offline) > * Node FILE-4: UNCLEAN (offline) > * Online: [ FILE-2 ] > * Online: [ FILE-3 ] > > At this stage FILE-1 and FILE-4 were continuously getting fenced (we > have device based stonith configured but the resource was not up ) . > Two nodes were online and two were offline. So quorum wasn't attained > again. > 1) For such a scenario we need help to be able to have one cluster > live . > 2) And in cases where only one node of the cluster is up and others > are down we need the resources and cluster to be up . The solution is to fix the fencing. Without fencing, there is no way to know that the other nodes are *actually* offline. It's possible that communication between the nodes has been temporarily interrupted, in which case recovering resources could lead to a "split-brain" situation that could corrupt data or make services unusable. Onboard IPMI is not a production fencing mechanism by itself, because it loses power when the node loses power. It's fine to use in a topology with a fallback method such as power fencing or watchdog-based SBD. > Thanks > Priyanka > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov < > arvidj...@gmail.com> wrote: > > On 26.06.2023 21:14, Priyanka Balotra wrote: > > > Hi All, > > > We are seeing an issue where we replaced no-quorum-policy=ignore > > with other > > > options in corosync.conf order to simulate the same behaviour : > > > > > > > > > * wait_for_all: 0* > > > > > > *last_man_standing: 1last_man_standing_window: > > 2* > > > > > > There was another property (auto-tie-breaker) tried but couldn't > > configure > > > it as crm did not recognise this property. > > > > > > But even after using these options, we are seeing that system is > > not > > > quorate if at least half of the nodes are not up. > > > > > > Some properties from crm config are as follows: > > > > > > > > > > > > *primitive stonith-sbd stonith:external/sbd \params > > > pcmk_delay_base=5s.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *.property cib-bootstrap-options: \have-watchdog=true \ > > > dc-version="2.1.2+20211124.ada5c3b36-150400.2.43- > > 2.1.2+20211124.ada5c3b36" > > > \cluster-infrastructure=corosync \cluster- > > name=FILE \ > > >stonith-enabled=true \stonith-timeout=172 \ > > > stonith-action=reboot \stop-all-resources=false \ > > > no-quorum-policy=ignorersc_defaults build-resource-defaults: \ > > > resource-stickiness=1rsc_defaults rsc-options: \ > > > resource-stickiness=100 \migration-threshold=3 \ > > > failure-timeout=1m \cluster-recheck- > > interval=10minop_defaults > > > op-options: \timeout=600 \record-pending=true* > > > > > > On a 4-node setup when the whole cluster is brought up together > > we see > > > error logs like: > > > > > > *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker- > > schedulerd[26359]: > > > warning: Fencing and resource management disabled due to lack of > > quorum* > > > > > > *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker- > > schedulerd[26359]: > > > warning: Ignoring malformed node_state entry without uname* > > > > > > *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker- > > schedulerd[26359]: > > > warning: Node FILE-2 is unclean!* > > > > > > *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker- > > schedulerd[26359]: > > > warning: Node FILE-3 is unclean!* > > > > > > *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker- > > schedulerd[26359]: > > > warning: Node FILE-4 is unclean!* > > > > > > > According to this output FILE-1 lost connection to three other > > nodes, in > > which case it cannot be quorate. > > > > > > > > Kindly help correct the configuration to make the system function > > normally > > > with all resources up, even if there is just one node up. > > > > > > Please let me know if any more info is needed. > > > > > > Thanks > > > Priyanka > > > > > > > > > ___ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > >