Re: [ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)
On Thu, Mar 24, 2022 at 4:12 PM Ken Gaillot wrote: > > On Wed, 2022-03-23 at 05:30 +, Balotra, Priyanka wrote: > > Hi All, > > > > We have a scenario on SLES 12 SP3 cluster. > > The scenario is explained as follows in the order of events: > > There is a 2-node cluster (FILE-1, FILE-2) > > The cluster and the resources were up and running fine initially . > > Then fencing request from pacemaker got issued on both nodes > > simultaneously > > > > Logs from 1st node: > > 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] > > Failed to receive the leave message. failed: 2 > > . > > . > > 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: > > notice: Requesting that FILE-1 perform 'off' action targeting FILE-2 > > > > Logs from 2nd node: > > 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] > > Failed to receive the leave message. failed: 1 > > . > > . > > Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith) > > notice: Requesting that FILE-2 perform 'off' action targeting FILE-1 > > > > When the nodes came up after unfencing, the DC got set after > > election > > After that the resources which were expected to run on only one node > > became active on both (all) nodes of the cluster. > > > > 27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource stonith-sbd is active on 2 nodes > > (attempting recovery) > > 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource FILE_Filesystem is active on 2 > > nodes (attem pting recovery) > > 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource IP_Floating is active on 2 nodes > > (attemptin g recovery) > > 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource Service_Postgresql is active on 2 > > nodes (at tempting recovery) > > 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource Service_Postgrest is active on 2 > > nodes (att empting recovery) > > 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource Service_esm_primary is active on 2 > > nodes (a ttempting recovery) > > 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker- > > schedulerd[5018]: notice: See > > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > > more information > > 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker- > > schedulerd[5018]: error: Resource Shared_Cluster_Backup is active on > > 2 nodes (attempting recovery) > > > > Can you guys please help us understand if this is indeed a split- > > brain scenario ? Under what circumstances can such a scenario be > > observed? > > This does look like a split-brain, and the most likely cause is that > the fence agent reported that fencing was successful, but it actually > wasn't. > > What are you using as a fencing device? > > If you're using watchdog-based SBD, that won't work with only two > nodes, because both nodes will assume they still have quorum, and not > self-fence. You need either true quorum or a shared external drive to > use SBD. We see a fencing-resource stonith_sbd so I would guess poison-pill-fencing is configured. So we should verify there isn't stonith-watchdog-timeout configured to anything but 0 as well - just to be sure it would never fall back to watchdog-fencing. Maybe you can try inserting the poison pill manually and see if the targeted node is rebooting. You can either do that using high-level tooling as crmsh or pcs or using the sbd-binary as cmdline-tool directly. You can try that both from the node to rebooted as well as from the other node. To e.g. check if both sides see the same disk(s) ... Check that the disk(s) configured with the sbd-service are the same as those configured for the sbd-fencing-resource (and of course when using sbd as cmdline tool to insert a poison pill the same disks have to be used as well). Is sbd-service running without complaints? Please check as
Re: [ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)
On Wed, 2022-03-23 at 05:30 +, Balotra, Priyanka wrote: > Hi All, > > We have a scenario on SLES 12 SP3 cluster. > The scenario is explained as follows in the order of events: > There is a 2-node cluster (FILE-1, FILE-2) > The cluster and the resources were up and running fine initially . > Then fencing request from pacemaker got issued on both nodes > simultaneously > > Logs from 1st node: > 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] > Failed to receive the leave message. failed: 2 > . > . > 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: > notice: Requesting that FILE-1 perform 'off' action targeting FILE-2 > > Logs from 2nd node: > 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] > Failed to receive the leave message. failed: 1 > . > . > Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith) > notice: Requesting that FILE-2 perform 'off' action targeting FILE-1 > > When the nodes came up after unfencing, the DC got set after > election > After that the resources which were expected to run on only one node > became active on both (all) nodes of the cluster. > > 27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker- > schedulerd[5018]: error: Resource stonith-sbd is active on 2 nodes > (attempting recovery) > 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker- > schedulerd[5018]: notice: See > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker- > schedulerd[5018]: error: Resource FILE_Filesystem is active on 2 > nodes (attem pting recovery) > 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker- > schedulerd[5018]: notice: See > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker- > schedulerd[5018]: error: Resource IP_Floating is active on 2 nodes > (attemptin g recovery) > 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker- > schedulerd[5018]: notice: See > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker- > schedulerd[5018]: error: Resource Service_Postgresql is active on 2 > nodes (at tempting recovery) > 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker- > schedulerd[5018]: notice: See > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker- > schedulerd[5018]: error: Resource Service_Postgrest is active on 2 > nodes (att empting recovery) > 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker- > schedulerd[5018]: notice: See > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker- > schedulerd[5018]: error: Resource Service_esm_primary is active on 2 > nodes (a ttempting recovery) > 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker- > schedulerd[5018]: notice: See > https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker- > schedulerd[5018]: error: Resource Shared_Cluster_Backup is active on > 2 nodes (attempting recovery) > > Can you guys please help us understand if this is indeed a split- > brain scenario ? Under what circumstances can such a scenario be > observed? This does look like a split-brain, and the most likely cause is that the fence agent reported that fencing was successful, but it actually wasn't. What are you using as a fencing device? If you're using watchdog-based SBD, that won't work with only two nodes, because both nodes will assume they still have quorum, and not self-fence. You need either true quorum or a shared external drive to use SBD. > We can have very serious impact if such a case can re-occur inspite > of stonith already configured. Hence the ask . > In case this situation gets reproduced, how can it be handled? > > Note: We have stonith configured and it has been working fine so far. > In this case also, the initial fencing happened from stonith only. > > Thanks in advance! -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)
On 23.03.2022 08:30, Balotra, Priyanka wrote: > Hi All, > > We have a scenario on SLES 12 SP3 cluster. > The scenario is explained as follows in the order of events: > > * There is a 2-node cluster (FILE-1, FILE-2) > * The cluster and the resources were up and running fine initially . > * Then fencing request from pacemaker got issued on both nodes > simultaneously > > Logs from 1st node: > 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] Failed to > receive the leave message. failed: 2 > . > . > 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: notice: > Requesting that FILE-1 perform 'off' action targeting FILE-2 > > Logs from 2nd node: > 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] Failed to > receive the leave message. failed: 1 > . > . > Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith) notice: > Requesting that FILE-2 perform 'off' action targeting FILE-1 > This is normal behavior in case of split brain. Each node will try to fence another node to be able to take over resources from it. > > * When the nodes came up after unfencing, the DC got set after election What exactly "came up" means? > * After that the resources which were expected to run on only one node > became active on both (all) nodes of the cluster. > It sounds like both nodes believed fencing has been successful and so each node took over resources from another node. It is impossible to tell more without seeing actual logs from both nodes and actual configuration. > 27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker-schedulerd[5018]: > error: Resource stonith-sbd is active on 2 nodes (attempting recovery) > 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-schedulerd[5018]: > notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker-schedulerd[5018]: > error: Resource FILE_Filesystem is active on 2 nodes (attem pting recovery) > 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-schedulerd[5018]: > notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker-schedulerd[5018]: > error: Resource IP_Floating is active on 2 nodes (attemptin g recovery) > 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-schedulerd[5018]: > notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker-schedulerd[5018]: > error: Resource Service_Postgresql is active on 2 nodes (at tempting recovery) > 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-schedulerd[5018]: > notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-schedulerd[5018]: > error: Resource Service_Postgrest is active on 2 nodes (att empting recovery) > 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-schedulerd[5018]: > notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker-schedulerd[5018]: > error: Resource Service_esm_primary is active on 2 nodes (a ttempting > recovery) > 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-schedulerd[5018]: > notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for > more information > 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker-schedulerd[5018]: > error: Resource Shared_Cluster_Backup is active on 2 nodes (attempting > recovery) > > > Can you guys please help us understand if this is indeed a split-brain > scenario ? I do not understand this question and I suspect you are using "split brain" incorrectly. Split brain is condition when corosync/pacemaker on two nodes cannot communicate. Split brain ends with fencing request. > Under what circumstances can such a scenario be observed? When two nodes are unable to communicate with each other if "such scenario" refers to "split brain". > We can have very serious impact if such a case can re-occur inspite of > stonith already configured. Hence the ask . > In case this situation gets reproduced, how can it be handled? > Stonith agent must never return success unless it can confirm that fencing was successful. > Note: We have stonith configured and it has been working fine so far. In this > case also, the initial fencing happened from stonith only. > > Thanks in advance! > > > > > > Internal Use - Confidential > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription:
[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)
Hi All, We have a scenario on SLES 12 SP3 cluster. The scenario is explained as follows in the order of events: - There is a 2-node cluster (FILE-1, FILE-2) - The cluster and the resources were up and running fine initially . - Then fencing request from pacemaker got issued on both nodes simultaneously Logs from 1st node: 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] Failed to receive the leave message. failed: 2 . . 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: notice: Requesting that FILE-1 perform 'off' action targeting FILE-2 Logs from 2nd node: 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] Failed to receive the leave message. failed: 1 . . Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith) notice: Requesting that FILE-2 perform 'off' action targeting FILE-1 - When the nodes came up after unfencing, the DC got set after election - After that the resources which were expected to run on only one node became active on both (all) nodes of the cluster. 27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource stonith-sbd is active on 2 nodes (attempting recovery) 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource FILE_Filesystem is active on 2 nodes (attem pting recovery) 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource IP_Floating is active on 2 nodes (attemptin g recovery) 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_Postgresql is active on 2 nodes (at tempting recovery) 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_Postgrest is active on 2 nodes (att empting recovery) 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_esm_primary is active on 2 nodes (a ttempting recovery) 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Shared_Cluster_Backup is active on 2 nodes (attempting recovery) Can you guys please help us understand if this is indeed a split-brain scenario ? Under what circumstances can such a scenario be observed? We can have very serious impact if such a case can re-occur inspite of stonith already configured. Hence the ask . In case this situation gets reproduced, how can it be handled? Note: We have stonith configured and it has been working fine so far. In this case also, the initial fencing happened from stonith only. Thanks in advance! Priyanka ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] Resources too_active (active on all nodes of the cluster, instead of only 1 node)
Hi All, We have a scenario on SLES 12 SP3 cluster. The scenario is explained as follows in the order of events: * There is a 2-node cluster (FILE-1, FILE-2) * The cluster and the resources were up and running fine initially . * Then fencing request from pacemaker got issued on both nodes simultaneously Logs from 1st node: 2022-02-22T03:26:36.737075+00:00 FILE-1 corosync[12304]: [TOTEM ] Failed to receive the leave message. failed: 2 . . 2022-02-22T03:26:36.977888+00:00 FILE-1 pacemaker-fenced[12331]: notice: Requesting that FILE-1 perform 'off' action targeting FILE-2 Logs from 2nd node: 2022-02-22T03:26:36.738080+00:00 FILE-2 corosync[4989]: [TOTEM ] Failed to receive the leave message. failed: 1 . . Feb 22 03:26:38 FILE-2 pacemaker-fenced [5015] (call_remote_stonith) notice: Requesting that FILE-2 perform 'off' action targeting FILE-1 * When the nodes came up after unfencing, the DC got set after election * After that the resources which were expected to run on only one node became active on both (all) nodes of the cluster. 27290 2022-02-22T04:16:31.699186+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource stonith-sbd is active on 2 nodes (attempting recovery) 27291 2022-02-22T04:16:31.699397+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27292 2022-02-22T04:16:31.699590+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource FILE_Filesystem is active on 2 nodes (attem pting recovery) 27293 2022-02-22T04:16:31.699731+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27294 2022-02-22T04:16:31.699878+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource IP_Floating is active on 2 nodes (attemptin g recovery) 27295 2022-02-22T04:16:31.700027+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27296 2022-02-22T04:16:31.700203+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_Postgresql is active on 2 nodes (at tempting recovery) 27297 2022-02-22T04:16:31.700354+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27298 2022-02-22T04:16:31.700501+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_Postgrest is active on 2 nodes (att empting recovery) 27299 2022-02-22T04:16:31.700648+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27300 2022-02-22T04:16:31.700792+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Service_esm_primary is active on 2 nodes (a ttempting recovery) 27301 2022-02-22T04:16:31.700939+00:00 FILE-2 pacemaker-schedulerd[5018]: notice: See https://wiki.clusterlabs.org/wiki/FAQ#Resource_ is_Too_Active for more information 27302 2022-02-22T04:16:31.701086+00:00 FILE-2 pacemaker-schedulerd[5018]: error: Resource Shared_Cluster_Backup is active on 2 nodes (attempting recovery) Can you guys please help us understand if this is indeed a split-brain scenario ? Under what circumstances can such a scenario be observed? We can have very serious impact if such a case can re-occur inspite of stonith already configured. Hence the ask . In case this situation gets reproduced, how can it be handled? Note: We have stonith configured and it has been working fine so far. In this case also, the initial fencing happened from stonith only. Thanks in advance! Internal Use - Confidential ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/