On Wed, Jun 28, 2023 at 7:38 AM Klaus Wenninger <kwenn...@redhat.com> wrote:
> > > On Wed, Jun 28, 2023 at 3:30 AM Priyanka Balotra < > priyanka.14balo...@gmail.com> wrote: > >> I am using SLES 15 SP4. Is the no-quorum-policy still supported? >> >> > Thanks >> Priyanka >> >> On Wed, 28 Jun 2023 at 12:46 AM, Ken Gaillot <kgail...@redhat.com> wrote: >> >>> On Tue, 2023-06-27 at 22:38 +0530, Priyanka Balotra wrote: >>> > In this case stonith has been configured as a resource, >>> > primitive stonith-sbd stonith:external/sbd >>> >> > Then the error scenario you described looks like everybody lost connection > to the shared-storage. The nodes rebooting then probably rather suicided > instead of reading the poison-pill. And the quorate partition is staying > alive because > it is quorate but not seeing the shared-storage it can't verify that it > had been > able to write the poison-pill which makes the other nodes stay unclean. > But again just guessing ... > That said and without knowing details about your scenario and the failure-scenarios you want to cover you might consider watchdog-fencing. afaik Suse does support that as well for a while now. It gives you service-recovery from nodes that are cut off via network including their physical fencing-devices. I know that poison-pill-fencing should do that as well as long as the quorate part of the cluster is able to access the shared-disk but in your scenario this doesn't seem to be the case. Just out of curiosity: Are you using poison-pill with multiple shared disks? Asking as in that case the poison-pill may still be passed via a single disk and the target would reboot but the other side that initiated fencing might not recover resources as it might not have been able to write the poison-pill to a quorate number of disks. Klaus > > >> > >>> > For it to be functional properly , the resource needs to be up, which >>> > is only possible if the system is quorate. >>> >>> Pacemaker can use a fence device even if its resource is not active. >>> The resource being active just allows Pacemaker to monitor the device >>> regularly. >>> >>> > >>> > Hence our requirement is to make the system quorate even if one Node >>> > of the cluster is up. >>> > Stonith will then take care of any split-brain scenarios. >>> >>> In that case it sounds like no-quorum-policy=ignore is actually what >>> you want. >>> >> > Still dangerous without something like wait-for-all - right? > With LMS I guess you should have the same effect without having explicitly > specified though. > > Klaus > > >> >>> > >>> > Thanks >>> > Priyanka >>> > >>> > On Tue, Jun 27, 2023 at 9:06 PM Klaus Wenninger <kwenn...@redhat.com> >>> > wrote: >>> > > >>> > > On Tue, Jun 27, 2023 at 5:24 PM Andrei Borzenkov < >>> > > arvidj...@gmail.com> wrote: >>> > > > On 27.06.2023 07:21, Priyanka Balotra wrote: >>> > > > > Hi Andrei, >>> > > > > After this state the system went through some more fencings and >>> > > > we saw the >>> > > > > following state: >>> > > > > >>> > > > > :~ # crm status >>> > > > > Cluster Summary: >>> > > > > * Stack: corosync >>> > > > > * Current DC: FILE-2 (version >>> > > > > 2.1.2+20211124.ada5c3b36-150400.2.43-2.1.2+20211124.ada5c3b36) >>> > > > - partition >>> > > > > with quorum >>> > > > >>> > > > It says "partition with quorum" so what exactly is the problem? >>> > > >>> > > I guess the problem is that resources aren't being recovered on >>> > > the nodes in the quorate partition. >>> > > Reason for that is probably that - as Ken was already suggesting - >>> > > fencing isn't >>> > > working properly or fencing-devices used are simply inappropriate >>> > > for the >>> > > purpose (e.g. onboard IPMI). >>> > > The fact that a node is rebooting isn't enough. The node that >>> > > initiated fencing >>> > > has to know that it did actually work. But we're just guessing >>> > > here. Logs should >>> > > show what is actually going on. >>> > > >>> > > Klaus >>> > > > > * Last updated: Mon Jun 26 12:44:15 2023 >>> > > > > * Last change: Mon Jun 26 12:41:12 2023 by root via >>> > > > cibadmin on FILE-2 >>> > > > > * 4 nodes configured >>> > > > > * 11 resource instances configured >>> > > > > >>> > > > > Node List: >>> > > > > * Node FILE-1: UNCLEAN (offline) >>> > > > > * Node FILE-4: UNCLEAN (offline) >>> > > > > * Online: [ FILE-2 ] >>> > > > > * Online: [ FILE-3 ] >>> > > > > >>> > > > > At this stage FILE-1 and FILE-4 were continuously getting >>> > > > fenced (we have >>> > > > > device based stonith configured but the resource was not up ) . >>> > > > > Two nodes were online and two were offline. So quorum wasn't >>> > > > attained >>> > > > > again. >>> > > > > 1) For such a scenario we need help to be able to have one >>> > > > cluster live . >>> > > > > 2) And in cases where only one node of the cluster is up and >>> > > > others are >>> > > > > down we need the resources and cluster to be up . >>> > > > > >>> > > > > Thanks >>> > > > > Priyanka >>> > > > > >>> > > > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov < >>> > > > arvidj...@gmail.com> >>> > > > > wrote: >>> > > > > >>> > > > >> On 26.06.2023 21:14, Priyanka Balotra wrote: >>> > > > >>> Hi All, >>> > > > >>> We are seeing an issue where we replaced no-quorum- >>> > > > policy=ignore with >>> > > > >> other >>> > > > >>> options in corosync.conf order to simulate the same behaviour >>> > > > : >>> > > > >>> >>> > > > >>> >>> > > > >>> * wait_for_all: 0* >>> > > > >>> >>> > > > >>> * last_man_standing: 1 >>> > > > last_man_standing_window: 20000* >>> > > > >>> >>> > > > >>> There was another property (auto-tie-breaker) tried but >>> > > > couldn't >>> > > > >> configure >>> > > > >>> it as crm did not recognise this property. >>> > > > >>> >>> > > > >>> But even after using these options, we are seeing that system >>> > > > is not >>> > > > >>> quorate if at least half of the nodes are not up. >>> > > > >>> >>> > > > >>> Some properties from crm config are as follows: >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> *primitive stonith-sbd stonith:external/sbd \ params >>> > > > >>> pcmk_delay_base=5s.* >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> >>> > > > >>> *.property cib-bootstrap-options: \ have-watchdog=true >>> > > > \ >>> > > > >>> >>> > > > >> dc-version="2.1.2+20211124.ada5c3b36-150400.2.43- >>> > > > 2.1.2+20211124.ada5c3b36" >>> > > > >>> \ cluster-infrastructure=corosync \ cluster- >>> > > > name=FILE \ >>> > > > >>> stonith-enabled=true \ stonith-timeout=172 \ >>> > > > >>> stonith-action=reboot \ stop-all-resources=false \ >>> > > > >>> no-quorum-policy=ignorersc_defaults build-resource-defaults: >>> > > > \ >>> > > > >>> resource-stickiness=1rsc_defaults rsc-options: \ >>> > > > >>> resource-stickiness=100 \ migration-threshold=3 \ >>> > > > >>> failure-timeout=1m \ cluster-recheck- >>> > > > interval=10minop_defaults >>> > > > >>> op-options: \ timeout=600 \ record- >>> > > > pending=true* >>> > > > >>> >>> > > > >>> On a 4-node setup when the whole cluster is brought up >>> > > > together we see >>> > > > >>> error logs like: >>> > > > >>> >>> > > > >>> *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker- >>> > > > schedulerd[26359]: >>> > > > >>> warning: Fencing and resource management disabled due to lack >>> > > > of quorum* >>> > > > >>> >>> > > > >>> *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker- >>> > > > schedulerd[26359]: >>> > > > >>> warning: Ignoring malformed node_state entry without uname* >>> > > > >>> >>> > > > >>> *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker- >>> > > > schedulerd[26359]: >>> > > > >>> warning: Node FILE-2 is unclean!* >>> > > > >>> >>> > > > >>> *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker- >>> > > > schedulerd[26359]: >>> > > > >>> warning: Node FILE-3 is unclean!* >>> > > > >>> >>> > > > >>> *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker- >>> > > > schedulerd[26359]: >>> > > > >>> warning: Node FILE-4 is unclean!* >>> > > > >>> >>> > > > >> >>> > > > >> According to this output FILE-1 lost connection to three other >>> > > > nodes, in >>> > > > >> which case it cannot be quorate. >>> > > > >> >>> > > > >>> >>> > > > >>> Kindly help correct the configuration to make the system >>> > > > function >>> > > > >> normally >>> > > > >>> with all resources up, even if there is just one node up. >>> > > > >>> >>> > > > >>> Please let me know if any more info is needed. >>> > > > >>> >>> > > > >>> Thanks >>> > > > >>> Priyanka >>> -- >>> Ken Gaillot <kgail...@redhat.com> >>> >>> _______________________________________________ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >>> >>
_______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/