On Tue, 2023-06-27 at 09:51 +0530, Priyanka Balotra wrote: > Hi Andrei, > After this state the system went through some more fencings and we > saw the following state: > > :~ # crm status > Cluster Summary: > * Stack: corosync > * Current DC: FILE-2 (version 2.1.2+20211124.ada5c3b36-150400.2.43- > 2.1.2+20211124.ada5c3b36) - partition with quorum > * Last updated: Mon Jun 26 12:44:15 2023 > * Last change: Mon Jun 26 12:41:12 2023 by root via cibadmin on > FILE-2 > * 4 nodes configured > * 11 resource instances configured > > Node List: > * Node FILE-1: UNCLEAN (offline) > * Node FILE-4: UNCLEAN (offline) > * Online: [ FILE-2 ] > * Online: [ FILE-3 ] > > At this stage FILE-1 and FILE-4 were continuously getting fenced (we > have device based stonith configured but the resource was not up ) . > Two nodes were online and two were offline. So quorum wasn't attained > again. > 1) For such a scenario we need help to be able to have one cluster > live . > 2) And in cases where only one node of the cluster is up and others > are down we need the resources and cluster to be up .
The solution is to fix the fencing. Without fencing, there is no way to know that the other nodes are *actually* offline. It's possible that communication between the nodes has been temporarily interrupted, in which case recovering resources could lead to a "split-brain" situation that could corrupt data or make services unusable. Onboard IPMI is not a production fencing mechanism by itself, because it loses power when the node loses power. It's fine to use in a topology with a fallback method such as power fencing or watchdog-based SBD. > Thanks > Priyanka > > On Tue, Jun 27, 2023 at 12:25 AM Andrei Borzenkov < > arvidj...@gmail.com> wrote: > > On 26.06.2023 21:14, Priyanka Balotra wrote: > > > Hi All, > > > We are seeing an issue where we replaced no-quorum-policy=ignore > > with other > > > options in corosync.conf order to simulate the same behaviour : > > > > > > > > > * wait_for_all: 0* > > > > > > * last_man_standing: 1 last_man_standing_window: > > 20000* > > > > > > There was another property (auto-tie-breaker) tried but couldn't > > configure > > > it as crm did not recognise this property. > > > > > > But even after using these options, we are seeing that system is > > not > > > quorate if at least half of the nodes are not up. > > > > > > Some properties from crm config are as follows: > > > > > > > > > > > > *primitive stonith-sbd stonith:external/sbd \ params > > > pcmk_delay_base=5s.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > *.property cib-bootstrap-options: \ have-watchdog=true \ > > > dc-version="2.1.2+20211124.ada5c3b36-150400.2.43- > > 2.1.2+20211124.ada5c3b36" > > > \ cluster-infrastructure=corosync \ cluster- > > name=FILE \ > > > stonith-enabled=true \ stonith-timeout=172 \ > > > stonith-action=reboot \ stop-all-resources=false \ > > > no-quorum-policy=ignorersc_defaults build-resource-defaults: \ > > > resource-stickiness=1rsc_defaults rsc-options: \ > > > resource-stickiness=100 \ migration-threshold=3 \ > > > failure-timeout=1m \ cluster-recheck- > > interval=10minop_defaults > > > op-options: \ timeout=600 \ record-pending=true* > > > > > > On a 4-node setup when the whole cluster is brought up together > > we see > > > error logs like: > > > > > > *2023-06-26T11:35:17.231104+00:00 FILE-1 pacemaker- > > schedulerd[26359]: > > > warning: Fencing and resource management disabled due to lack of > > quorum* > > > > > > *2023-06-26T11:35:17.231338+00:00 FILE-1 pacemaker- > > schedulerd[26359]: > > > warning: Ignoring malformed node_state entry without uname* > > > > > > *2023-06-26T11:35:17.233771+00:00 FILE-1 pacemaker- > > schedulerd[26359]: > > > warning: Node FILE-2 is unclean!* > > > > > > *2023-06-26T11:35:17.233857+00:00 FILE-1 pacemaker- > > schedulerd[26359]: > > > warning: Node FILE-3 is unclean!* > > > > > > *2023-06-26T11:35:17.233957+00:00 FILE-1 pacemaker- > > schedulerd[26359]: > > > warning: Node FILE-4 is unclean!* > > > > > > > According to this output FILE-1 lost connection to three other > > nodes, in > > which case it cannot be quorate. > > > > > > > > Kindly help correct the configuration to make the system function > > normally > > > with all resources up, even if there is just one node up. > > > > > > Please let me know if any more info is needed. > > > > > > Thanks > > > Priyanka > > > > > > > > > _______________________________________________ > > > Manage your subscription: > > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Ken Gaillot <kgail...@redhat.com> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/