On Thu, Mar 24, 2022 at 4:12 PM Ken Gaillot <[email protected]> wrote: > > On Wed, 2022-03-23 at 05:30 +0000, Balotra, Priyanka wrote: > > Hi All, > > > > We have a scenario on SLES 12 SP3 cluster. > > The scenario is explained as follows in the order of events: > > There is a 2-node cluster (FILE-1, FILE-2) > > The cluster and the resources were up and running fine initially . > > Then fencing request from pacemaker got issued on both nodes > > simultaneously > > > > Logs from 1st node: > > 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] > > Failed to receive the leave message. failed: 2 > > . > > . > > 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: > > notice: Requesting that FILE-1 perform 'off' action targeting FILE-2 > > > > Logs from 2nd node: > > 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] > > Failed to receive the leave message. failed: 1 > > . > > . > > Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith) > > notice: Requesting that FILE-2 perform 'off' action targeting FILE-1 > > > > When the nodes came up after unfencing, the DC got set after > > election > > After that the resources which were expected to run on only one node > > became active on both (all) nodes of the cluster. > > > > 27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource stonith-sbd is active on 2 nodes > > (attempting recovery) > > 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource FILE_Filesystem is active on 2 > > nodes (attem pting recovery) > > 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource IP_Floating is active on 2 nodes > > (attemptin g recovery) > > 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource Service_Postgresql is active on 2 > > nodes (at tempting recovery) > > 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource Service_Postgrest is active on 2 > > nodes (att empting recovery) > > 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource Service_esm_primary is active on 2 > > nodes (a ttempting recovery) > > 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource Shared_Cluster_Backup is active on > > 2 nodes (attempting recovery) > > > > Can you guys please help us understand if this is indeed a split- > > brain scenario ? Under what circumstances can such a scenario be > > observed? > > This does look like a split-brain, and the most likely cause is that > the fence agent reported that fencing was successful, but it actually > wasn't. > > What are you using as a fencing device? > > If you're using watchdog-based SBD, that won't work with only two > nodes, because both nodes will assume they still have quorum, and not > self-fence. You need either true quorum or a shared external drive to > use SBD.
We see a fencing-resource stonith_sbd so I would guess poison-pill-fencing is configured. So we should verify there isn't stonith-watchdog-timeout configured to anything but 0 as well - just to be sure it would never fall back to watchdog-fencing. Maybe you can try inserting the poison pill manually and see if the targeted node is rebooting. You can either do that using high-level tooling as crmsh or pcs or using the sbd-binary as cmdline-tool directly. You can try that both from the node to rebooted as well as from the other node. To e.g. check if both sides see the same disk(s) ... Check that the disk(s) configured with the sbd-service are the same as those configured for the sbd-fencing-resource (and of course when using sbd as cmdline tool to insert a poison pill the same disks have to be used as well). Is sbd-service running without complaints? Please check as well for a (hardware)-watchdog properly configured with sbd. In this case I guess we should have seen a reboot still - even with non working watchdog - as both nodes seem to be alive enough. But still it is important to work properly for cases where nodes aren't responsive anymore. Klaus > > > We can have very serious impact if such a case can re-occur inspite > > of stonith already configured. Hence the ask . > > In case this situation gets reproduced, how can it be handled? > > > > Note: We have stonith configured and it has been working fine so far. > > In this case also, the initial fencing happened from stonith only. > > > > Thanks in advance! > -- > Ken Gaillot <[email protected]> > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
