Re: [ClusterLabs] resource start after network reconnected
You are right, but usually when the SBD disk has failed, I always focus on recovering it as soon as possible. Once the disk is recovered and the watcher detects it back - shutting down is possible. And of course disk-based sbd is better than nothing. Best Regards,Strahil Nikolov On Sun, Nov 21, 2021 at 8:47, Andrei Borzenkov wrote: On 21.11.2021 00:39, Strahil Nikolov via Users wrote: > Nope, as long as you use SBD's integration with pacemaker. As the 2 nodes can > communicate between each other sbd won't act. I thinkt it was an entry like > this in the /etc/sysconfig/sbd: 'SBD_PACEMAKER=yes' > That's correct except it is impossible to stop pacemaker on one node under this condition because the remaining node will immediately commit suicide. It is not even possible to perform normal cluster shutdown. I wish SBD supported "deactivate" message to stop pretending that it knows better than administrator or - even better - understood that pacemaker is stopping intentionally. Currently there is no way around it (short of pkill -9 sbd) because systemd unit refuses manual SBD stop. > > On Sat, Nov 20, 2021 at 23:24, Valentin Vidić via >Users wrote: On Sat, Nov 20, 2021 at 08:33:26PM +, >Strahil Nikolov via Users wrote: >> You can also use this 3rd node to provide iSCSI and then the SBD will >> be disk-full :D . The good thing about this type of setup is that you >> do won't need to put location constraints for the 3rd node. > > Wouldn't that make the iSCSI node a SPOF? If the iSCSI goes down, SBD > resets both cluster nodes. > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
On 21.11.2021 00:39, Strahil Nikolov via Users wrote: > Nope, as long as you use SBD's integration with pacemaker. As the 2 nodes can > communicate between each other sbd won't act. I thinkt it was an entry like > this in the /etc/sysconfig/sbd: 'SBD_PACEMAKER=yes' > That's correct except it is impossible to stop pacemaker on one node under this condition because the remaining node will immediately commit suicide. It is not even possible to perform normal cluster shutdown. I wish SBD supported "deactivate" message to stop pretending that it knows better than administrator or - even better - understood that pacemaker is stopping intentionally. Currently there is no way around it (short of pkill -9 sbd) because systemd unit refuses manual SBD stop. > > On Sat, Nov 20, 2021 at 23:24, Valentin Vidić via > Users wrote: On Sat, Nov 20, 2021 at 08:33:26PM > +, Strahil Nikolov via Users wrote: >> You can also use this 3rd node to provide iSCSI and then the SBD will >> be disk-full :D . The good thing about this type of setup is that you >> do won't need to put location constraints for the 3rd node. > > Wouldn't that make the iSCSI node a SPOF? If the iSCSI goes down, SBD > resets both cluster nodes. > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
Nope, as long as you use SBD's integration with pacemaker. As the 2 nodes can communicate between each other sbd won't act. I thinkt it was an entry like this in the /etc/sysconfig/sbd: 'SBD_PACEMAKER=yes' On Sat, Nov 20, 2021 at 23:24, Valentin Vidić via Users wrote: On Sat, Nov 20, 2021 at 08:33:26PM +, Strahil Nikolov via Users wrote: > You can also use this 3rd node to provide iSCSI and then the SBD will > be disk-full :D . The good thing about this type of setup is that you > do won't need to put location constraints for the 3rd node. Wouldn't that make the iSCSI node a SPOF? If the iSCSI goes down, SBD resets both cluster nodes. -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
On Sat, Nov 20, 2021 at 08:33:26PM +, Strahil Nikolov via Users wrote: > You can also use this 3rd node to provide iSCSI and then the SBD will > be disk-full :D . The good thing about this type of setup is that you > do won't need to put location constraints for the 3rd node. Wouldn't that make the iSCSI node a SPOF? If the iSCSI goes down, SBD resets both cluster nodes. -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
You can also use this 3rd node to provide iSCSI and then the SBD will be disk-full :D . The good thing about this type of setup is that you do won't need to put location constraints for the 3rd node. Also, check the ping resource -> you can set it up to "kick-out" all resources on failure of ping to a specific ip (for example the gateway). Once the network is restored, the node automatically becomes eligible to host the resources. Also consider more advanced resource agents like ocf:heartbeat:mysql to control your mysql/mariadb database and also a replication between a primary and secondary (a.k.a master-slave ) replication. Best Regards, Strahil Nikolov В петък, 19 ноември 2021 г., 21:46:22 Гринуич+2, john tillman написа: > On Fri, Nov 19, 2021 at 11:26:01AM -0500, john tillman wrote: >> Anyone have any other ideas for a configuration setting that will >> effectively do whatever 'pcs resource refresh' is doing when quorum is >> restored? > > Since you have three nodes you may want to use the third node as QDevice > instead: > > https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-qdevice.html > > After that SBD can be configured in diskless mode to reset the node that > loses quorum: > > https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-protect.html#sec-ha-storage-protect-diskless-sbd > Thank you. I'll look into using the Qdevice in the next release. For now, I just have the three nodes with "vanilla" cluster packages. > -- > Valentin > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/ ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
On 19.11.2021 20:45, Ken Gaillot wrote: > On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote: > > > >>> If pacemaker tries to stop resources due to out of quorum >>> condition, you >>> could set suitable failure-timeout; this will be equivalent to >>> using "pcs >>> resource refresh". Keep in mind that pacemaker only checks for >>> failure-timeout expiration every cluster-recheck-interval (15 > > That's true only for Pacemaker versions less than 2.0.3; since 2.0.3, > the cluster rechecks as soon as the timeout hits. > Indeed. Thank you! ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
On 19.11.2021 19:26, john tillman wrote: ... >>> >>> If pacemaker tries to stop resources due to out of quorum condition, you >>> could set suitable failure-timeout; this will be equivalent to using >>> "pcs >>> resource refresh". Keep in mind that pacemaker only checks for >>> failure-timeout expiration every cluster-recheck-interval (15 minutes by >>> default). This still is not directly related to network availability, >>> but >>> if network outage resulted in node going out of quorum, when network is >>> back and node joined cluster again it will allow resources to be started >>> on node. >>> >> >> When quorum is lost I want all the resources to stop. The cluster is >> performing this step correctly for me. >> >> That cluster-recheck-interval would explain the intermittence I saw this >> morning. If I set that to 1 minute would that cause any gross negative >> issues? >> > > > I tried setting cluster-recheck-interval to 1 minute and I saw no change > to the resources after reconnecting the network. They were still listed > as However, "pcs resource refresh" started it, as usual in this scenario. > > Anyone have any other ideas for a configuration setting that will > effectively do whatever 'pcs resource refresh' is doing when quorum is > restored? > I already told you above and it most certainly works here. Without failure-timeout resource is stuck in blocked state: Cluster Summary: * Stack: corosync * Current DC: ha1 (version 2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum * Last updated: Sat Nov 20 10:48:48 2021 * Last change: Sat Nov 20 10:46:55 2021 by root via cibadmin on ha1 * 3 nodes configured * 3 resource instances configured (1 BLOCKED from further action due to failure) Node List: * Online: [ ha1 ha2 qnetd ] Full List of Resources: * Clone Set: cln_Test [rsc_Test]: * rsc_Test (ocf::_local:Dummy): FAILED ha1 (blocked) * Started: [ ha2 ] * Stopped: [ qnetd ] Operations: * Node: ha2: * rsc_Test: migration-threshold=100: * (10) start * (11) monitor: interval="1ms" * Node: ha1: * rsc_Test: migration-threshold=100 fail-count=100 last-failure='Sat Nov 20 10:47:14 2021': * (18) start * (30) stop Failed Resource Actions: * rsc_Test_stop_0 on ha1 'error' (1): call=30, status='complete', exitreason='forced to fail stop operation', last-rc-change='2021-11-20 10:47:14 +03:00', queued=0ms, exec=27ms With failure-timeout resource is restarted after expiration. Cluster Summary: * Stack: corosync * Current DC: ha1 (version 2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c) - partition with quorum * Last updated: Sat Nov 20 10:53:51 2021 * Last change: Sat Nov 20 10:50:37 2021 by root via cibadmin on ha2 * 3 nodes configured * 3 resource instances configured Node List: * Online: [ ha1 ha2 qnetd ] Full List of Resources: * Clone Set: cln_Test [rsc_Test]: * Started: [ ha1 ha2 ] * Stopped: [ qnetd ] Operations: * Node: ha2: * rsc_Test: migration-threshold=100: * (18) probe * (18) probe * (19) monitor: interval="1ms" * Node: ha1: * rsc_Test: migration-threshold=100: * (40) probe * (40) probe * (41) monitor: interval="1ms" Configuration: node 1: ha1 \ attributes pingd=1 \ utilization cpu=20 node 2: ha2 \ attributes pingd=1 \ utilization cpu=20 node 3: qnetd primitive rsc_Test ocf:_local:Dummy \ meta failure-timeout=30s \ op monitor interval=10s clone cln_Test rsc_Test location not_on_qnetd cln_Test -inf: qnetd property cib-bootstrap-options: \ cluster-infrastructure=corosync \ cluster-name=ha \ dc-version="2.1.0+20210816.c6a4f6e6c-1.1-2.1.0+20210816.c6a4f6e6c" \ last-lrm-refresh=1637394576 \ stonith-enabled=false \ have-watchdog=true \ stonith-watchdog-timeout=0 \ placement-strategy=balanced ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
On Fri, 2021-11-19 at 14:57 -0500, john tillman wrote: > > On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote: > > > > > > > > > > If pacemaker tries to stop resources due to out of quorum > > > > condition, you > > > > could set suitable failure-timeout; this will be equivalent to > > > > using "pcs > > > > resource refresh". Keep in mind that pacemaker only checks for > > > > failure-timeout expiration every cluster-recheck-interval (15 > > > > That's true only for Pacemaker versions less than 2.0.3; since > > 2.0.3, > > the cluster rechecks as soon as the timeout hits. > > I'm using pacemaker 2.0.5 and it is *not* starting MySQL when quorum > is > restored, at least not every time (~1 in 10). So I have seen it work That's due to a stop failure, not the recheck interval > before but I'm more willing to believe that there was a user error in > that > one successful sample. > > We (actual a team mate) got mysql to start when quorum is > restored. It > required both setting the cluster-recheck-interval to something more > frequent than 15min and setting the mysql resource's failure- > timeout to > non-zero. In our case we set both to 1 minute with good results for > the > last few tests. We can raise the frequency to something greater than > 1 > but for our tests, 1 proves it out. The failure-timeout is equivalent to running refresh when the timeout hits. The cluster will then re-probe the status of the resource and decide what, if anything, needs to be done about it. I can only see that working if the stop failure is transient -- i.e., either the stop actually succeeded but returned a failure code (or maybe timed out), and when the failure timeout or refresh happens, the re-probe sees the database is actually not running; or the stop really does fail, but by the time the failure timeout or refresh happens, another stop attempt after the re-probe is able to succeed. > > > > > > minutes by > > > > default). This still is not directly related to network > > > > availability, but > > > > if network outage resulted in node going out of quorum, when > > > > network is > > > > back and node joined cluster again it will allow resources to > > > > be > > > > started > > > > on node. > > > > > > > > > > When quorum is lost I want all the resources to stop. The > > > cluster is > > > performing this step correctly for me. > > > > As long as it's working properly. If quorum is lost because one of > > the > > nodes is malfunctioning -- maybe a device driver locked up the > > system, > > or CPU wait is horrific due to an out-of-control process or disk > > failure -- then that node will not know quorum has been lost and > > will > > not stop resources. If the condition then clears up, suddenly you > > have > > split-brain with two nodes running resources. > > > > > That cluster-recheck-interval would explain the intermittence I > > > saw > > > this > > > morning. If I set that to 1 minute would that cause any gross > > > negative > > > issues? > > > > It increases CPU usage and IPC traffic. For Pacemaker 2.0.3 or > > later, I > > definitely wouldn't bother. For older versions, 1 minute feels a > > bit > > much, I would go with around 5. > > > > > Is there another setting besides cluster-recheck-interval to > > > consider > > > adjusting to start mysql when quorum is returned? > > > > > > Thank you for the feedback. > > > > > > -John > > > > -- > > Ken Gaillot > > > > ___ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
> On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote: > > > >> > If pacemaker tries to stop resources due to out of quorum >> > condition, you >> > could set suitable failure-timeout; this will be equivalent to >> > using "pcs >> > resource refresh". Keep in mind that pacemaker only checks for >> > failure-timeout expiration every cluster-recheck-interval (15 > > That's true only for Pacemaker versions less than 2.0.3; since 2.0.3, > the cluster rechecks as soon as the timeout hits. I'm using pacemaker 2.0.5 and it is *not* starting MySQL when quorum is restored, at least not every time (~1 in 10). So I have seen it work before but I'm more willing to believe that there was a user error in that one successful sample. We (actual a team mate) got mysql to start when quorum is restored. It required both setting the cluster-recheck-interval to something more frequent than 15min and setting the mysql resource's failure-timeout to non-zero. In our case we set both to 1 minute with good results for the last few tests. We can raise the frequency to something greater than 1 but for our tests, 1 proves it out. > >> > minutes by >> > default). This still is not directly related to network >> > availability, but >> > if network outage resulted in node going out of quorum, when >> > network is >> > back and node joined cluster again it will allow resources to be >> > started >> > on node. >> > >> >> When quorum is lost I want all the resources to stop. The cluster is >> performing this step correctly for me. > > As long as it's working properly. If quorum is lost because one of the > nodes is malfunctioning -- maybe a device driver locked up the system, > or CPU wait is horrific due to an out-of-control process or disk > failure -- then that node will not know quorum has been lost and will > not stop resources. If the condition then clears up, suddenly you have > split-brain with two nodes running resources. > >> >> That cluster-recheck-interval would explain the intermittence I saw >> this >> morning. If I set that to 1 minute would that cause any gross >> negative >> issues? > > It increases CPU usage and IPC traffic. For Pacemaker 2.0.3 or later, I > definitely wouldn't bother. For older versions, 1 minute feels a bit > much, I would go with around 5. > >> >> Is there another setting besides cluster-recheck-interval to consider >> adjusting to start mysql when quorum is returned? >> >> Thank you for the feedback. >> >> -John > > -- > Ken Gaillot > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
> On Fri, Nov 19, 2021 at 11:26:01AM -0500, john tillman wrote: >> Anyone have any other ideas for a configuration setting that will >> effectively do whatever 'pcs resource refresh' is doing when quorum is >> restored? > > Since you have three nodes you may want to use the third node as QDevice > instead: > > https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-qdevice.html > > After that SBD can be configured in diskless mode to reset the node that > loses quorum: > > https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-protect.html#sec-ha-storage-protect-diskless-sbd > Thank you. I'll look into using the Qdevice in the next release. For now, I just have the three nodes with "vanilla" cluster packages. > -- > Valentin > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
On Fri, Nov 19, 2021 at 11:26:01AM -0500, john tillman wrote: > Anyone have any other ideas for a configuration setting that will > effectively do whatever 'pcs resource refresh' is doing when quorum is > restored? Since you have three nodes you may want to use the third node as QDevice instead: https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-qdevice.html After that SBD can be configured in diskless mode to reset the node that loses quorum: https://documentation.suse.com/sle-ha/15-SP1/html/SLE-HA-all/cha-ha-storage-protect.html#sec-ha-storage-protect-diskless-sbd -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote: > > If pacemaker tries to stop resources due to out of quorum > > condition, you > > could set suitable failure-timeout; this will be equivalent to > > using "pcs > > resource refresh". Keep in mind that pacemaker only checks for > > failure-timeout expiration every cluster-recheck-interval (15 That's true only for Pacemaker versions less than 2.0.3; since 2.0.3, the cluster rechecks as soon as the timeout hits. > > minutes by > > default). This still is not directly related to network > > availability, but > > if network outage resulted in node going out of quorum, when > > network is > > back and node joined cluster again it will allow resources to be > > started > > on node. > > > > When quorum is lost I want all the resources to stop. The cluster is > performing this step correctly for me. As long as it's working properly. If quorum is lost because one of the nodes is malfunctioning -- maybe a device driver locked up the system, or CPU wait is horrific due to an out-of-control process or disk failure -- then that node will not know quorum has been lost and will not stop resources. If the condition then clears up, suddenly you have split-brain with two nodes running resources. > > That cluster-recheck-interval would explain the intermittence I saw > this > morning. If I set that to 1 minute would that cause any gross > negative > issues? It increases CPU usage and IPC traffic. For Pacemaker 2.0.3 or later, I definitely wouldn't bother. For older versions, 1 minute feels a bit much, I would go with around 5. > > Is there another setting besides cluster-recheck-interval to consider > adjusting to start mysql when quorum is returned? > > Thank you for the feedback. > > -John -- Ken Gaillot ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
>> On 19.11.2021 17:36, john tillman wrote: On 18.11.2021 22:33, john tillman wrote: > > Greetings all, > > preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5 > > I have a mysql resource, cloned, that is behaving the way I wanted. > When > the node it is on is unplugged from the network quorum is lost and > the > mysqld service stops. Great. Oh, and fencing is disabled. > > When the network connectivity is restored I'd like it to restart but > it > doesn't. What needs to be done to make this happen automatically? > Or > what section of the doc should reread more thoroughly? > > When mysql is stopped because of the above, if I run "pcs resource > refresh" it starts? Any ideas why the "refresh" would do that? > You provided zero information about your setup and how you configured pacemaker to stop mysqld on network connectivity loss, so it is rather hard to guess. Logs covering period when you unplug network, and later plug again, could be also helpful. >>> >>> Fair point. I didn't want to put too much into the first email. There >>> are 3 nodes but 2 nodes are actually used for processing and the 3rd >>> node >>> is there just for quorum purposes. When quorum is lost my resources >>> stop. >>> There are 3 resources: a VIP, MySQL service, and controld (a project >>> specific service). >>> >>> And this problem has now become intermittent as 1 in 4 tests this >>> morning >>> succeeded in starting mysqld when the network was reconnected. Figures >>> :-/ >>> >>> More info. After reconnecting the network on spm238 the mysql resource >>> was listed as: >>> * spmDB (systemd:mysqld): FAILED spm238 (blocked) >>> >>> This was cleared and mysqld started after issuing a "pcs resource >>> refresh". >>> >> >> pcs resource refresh deletes failure history so pacemaker tries to start >> resource again. It is completely unrelated to network interface >> conditions. >> >> "blocked" is default when resource stop operation fails and stonith is >> disabled. >> >>> So as requested here's how I setup my cluster. It's copied from an >>> ansible playbook so there are some variables shown but should be easy >>> enough to understand. If not, I will gladly clarify anything. >>> >>> My 3 resources: >>> >>> pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }} >>> cidr_netmask=24 op monitor interval=10s >>> pcs resource create spmControl systemd:controld op monitor interval=10s >>> pcs resource create spmDB systemd:mysqld op monitor interval=10s clone >>> >>> My constraints: >>> pcs constraint colocation add spmControl with spmVIP INFINITY >>> pcs constraint colocation add spmVIP with spmDB-clone 200 >>> crm_resource -r spmVIP -p resource-stickiness -m -v 100 >>> crm_resource -r spmControl -p resource-stickiness -m -v 100 >>> >>> Don't run resources on the quorum only node: >>> pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY >>> pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY >>> pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY >>> >> >> I have no idea what QOnlynode means here. >> > > This is the quorum-only node of mine. Resources are not run on it and the > 3 constraints above are how I configured that. > >>> and stonith is false: >>> pcs property set stonith-enabled=false >>> >> >> I do not see anything in your configuration that would cause mysql to be >> stopped on network connectivity issues. Either mysql does it on its own, >> or pacemaker attempts to stop all resources on node when it goes out of >> quorum. >> >> If mysql does it on its own, there is nothing that can be done from >> pacemaker side. Pacemaker is not aware of network state at all and >> certainly cannot initiate actions when network becomes available. >> >> If pacemaker tries to stop resources due to out of quorum condition, you >> could set suitable failure-timeout; this will be equivalent to using >> "pcs >> resource refresh". Keep in mind that pacemaker only checks for >> failure-timeout expiration every cluster-recheck-interval (15 minutes by >> default). This still is not directly related to network availability, >> but >> if network outage resulted in node going out of quorum, when network is >> back and node joined cluster again it will allow resources to be started >> on node. >> > > When quorum is lost I want all the resources to stop. The cluster is > performing this step correctly for me. > > That cluster-recheck-interval would explain the intermittence I saw this > morning. If I set that to 1 minute would that cause any gross negative > issues? > I tried setting cluster-recheck-interval to 1 minute and I saw no change to the resources after reconnecting the network. They were still listed as However, "pcs resource refresh" started it, as usual in this scenario. Anyone have any other ideas for a configuration setting that will
Re: [ClusterLabs] resource start after network reconnected
> On 19.11.2021 17:36, john tillman wrote: >>> On 18.11.2021 22:33, john tillman wrote: Greetings all, preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5 I have a mysql resource, cloned, that is behaving the way I wanted. When the node it is on is unplugged from the network quorum is lost and the mysqld service stops. Great. Oh, and fencing is disabled. When the network connectivity is restored I'd like it to restart but it doesn't. What needs to be done to make this happen automatically? Or what section of the doc should reread more thoroughly? When mysql is stopped because of the above, if I run "pcs resource refresh" it starts? Any ideas why the "refresh" would do that? >>> >>> You provided zero information about your setup and how you configured >>> pacemaker to stop mysqld on network connectivity loss, so it is rather >>> hard to guess. >>> >>> Logs covering period when you unplug network, and later plug again, >>> could >>> be also helpful. >>> >> >> Fair point. I didn't want to put too much into the first email. There >> are 3 nodes but 2 nodes are actually used for processing and the 3rd >> node >> is there just for quorum purposes. When quorum is lost my resources >> stop. >> There are 3 resources: a VIP, MySQL service, and controld (a project >> specific service). >> >> And this problem has now become intermittent as 1 in 4 tests this >> morning >> succeeded in starting mysqld when the network was reconnected. Figures >> :-/ >> >> More info. After reconnecting the network on spm238 the mysql resource >> was listed as: >> * spmDB (systemd:mysqld): FAILED spm238 (blocked) >> >> This was cleared and mysqld started after issuing a "pcs resource >> refresh". >> > > pcs resource refresh deletes failure history so pacemaker tries to start > resource again. It is completely unrelated to network interface > conditions. > > "blocked" is default when resource stop operation fails and stonith is > disabled. > >> So as requested here's how I setup my cluster. It's copied from an >> ansible playbook so there are some variables shown but should be easy >> enough to understand. If not, I will gladly clarify anything. >> >> My 3 resources: >> >> pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }} >> cidr_netmask=24 op monitor interval=10s >> pcs resource create spmControl systemd:controld op monitor interval=10s >> pcs resource create spmDB systemd:mysqld op monitor interval=10s clone >> >> My constraints: >> pcs constraint colocation add spmControl with spmVIP INFINITY >> pcs constraint colocation add spmVIP with spmDB-clone 200 >> crm_resource -r spmVIP -p resource-stickiness -m -v 100 >> crm_resource -r spmControl -p resource-stickiness -m -v 100 >> >> Don't run resources on the quorum only node: >> pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY >> pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY >> pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY >> > > I have no idea what QOnlynode means here. > This is the quorum-only node of mine. Resources are not run on it and the 3 constraints above are how I configured that. >> and stonith is false: >> pcs property set stonith-enabled=false >> > > I do not see anything in your configuration that would cause mysql to be > stopped on network connectivity issues. Either mysql does it on its own, > or pacemaker attempts to stop all resources on node when it goes out of > quorum. > > If mysql does it on its own, there is nothing that can be done from > pacemaker side. Pacemaker is not aware of network state at all and > certainly cannot initiate actions when network becomes available. > > If pacemaker tries to stop resources due to out of quorum condition, you > could set suitable failure-timeout; this will be equivalent to using "pcs > resource refresh". Keep in mind that pacemaker only checks for > failure-timeout expiration every cluster-recheck-interval (15 minutes by > default). This still is not directly related to network availability, but > if network outage resulted in node going out of quorum, when network is > back and node joined cluster again it will allow resources to be started > on node. > When quorum is lost I want all the resources to stop. The cluster is performing this step correctly for me. That cluster-recheck-interval would explain the intermittence I saw this morning. If I set that to 1 minute would that cause any gross negative issues? Is there another setting besides cluster-recheck-interval to consider adjusting to start mysql when quorum is returned? Thank you for the feedback. -John >> If you'd rather see the cib file I can supply that. >> >> With respect to logs, pacemaker.log has the most relevant info, right, >> but >> there's a lot. It's 900+ lines from the time I unplug the network until >> mysql is restarted by the 'pcs resource refresh'. Any
Re: [ClusterLabs] resource start after network reconnected
On 19.11.2021 17:36, john tillman wrote: >> On 18.11.2021 22:33, john tillman wrote: >>> >>> Greetings all, >>> >>> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5 >>> >>> I have a mysql resource, cloned, that is behaving the way I wanted. >>> When >>> the node it is on is unplugged from the network quorum is lost and the >>> mysqld service stops. Great. Oh, and fencing is disabled. >>> >>> When the network connectivity is restored I'd like it to restart but it >>> doesn't. What needs to be done to make this happen automatically? Or >>> what section of the doc should reread more thoroughly? >>> >>> When mysql is stopped because of the above, if I run "pcs resource >>> refresh" it starts? Any ideas why the "refresh" would do that? >>> >> >> You provided zero information about your setup and how you configured >> pacemaker to stop mysqld on network connectivity loss, so it is rather >> hard to guess. >> >> Logs covering period when you unplug network, and later plug again, could >> be also helpful. >> > > Fair point. I didn't want to put too much into the first email. There > are 3 nodes but 2 nodes are actually used for processing and the 3rd node > is there just for quorum purposes. When quorum is lost my resources stop. > There are 3 resources: a VIP, MySQL service, and controld (a project > specific service). > > And this problem has now become intermittent as 1 in 4 tests this morning > succeeded in starting mysqld when the network was reconnected. Figures > :-/ > > More info. After reconnecting the network on spm238 the mysql resource > was listed as: > * spmDB (systemd:mysqld): FAILED spm238 (blocked) > > This was cleared and mysqld started after issuing a "pcs resource refresh". > pcs resource refresh deletes failure history so pacemaker tries to start resource again. It is completely unrelated to network interface conditions. "blocked" is default when resource stop operation fails and stonith is disabled. > So as requested here's how I setup my cluster. It's copied from an > ansible playbook so there are some variables shown but should be easy > enough to understand. If not, I will gladly clarify anything. > > My 3 resources: > > pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }} > cidr_netmask=24 op monitor interval=10s > pcs resource create spmControl systemd:controld op monitor interval=10s > pcs resource create spmDB systemd:mysqld op monitor interval=10s clone > > My constraints: > pcs constraint colocation add spmControl with spmVIP INFINITY > pcs constraint colocation add spmVIP with spmDB-clone 200 > crm_resource -r spmVIP -p resource-stickiness -m -v 100 > crm_resource -r spmControl -p resource-stickiness -m -v 100 > > Don't run resources on the quorum only node: > pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY > pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY > pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY > I have no idea what QOnlynode means here. > and stonith is false: > pcs property set stonith-enabled=false > I do not see anything in your configuration that would cause mysql to be stopped on network connectivity issues. Either mysql does it on its own, or pacemaker attempts to stop all resources on node when it goes out of quorum. If mysql does it on its own, there is nothing that can be done from pacemaker side. Pacemaker is not aware of network state at all and certainly cannot initiate actions when network becomes available. If pacemaker tries to stop resources due to out of quorum condition, you could set suitable failure-timeout; this will be equivalent to using "pcs resource refresh". Keep in mind that pacemaker only checks for failure-timeout expiration every cluster-recheck-interval (15 minutes by default). This still is not directly related to network availability, but if network outage resulted in node going out of quorum, when network is back and node joined cluster again it will allow resources to be started on node. > If you'd rather see the cib file I can supply that. > > With respect to logs, pacemaker.log has the most relevant info, right, but > there's a lot. It's 900+ lines from the time I unplug the network until > mysql is restarted by the 'pcs resource refresh'. Any suggestions for how > to present the info here? Maybe use grep for some key words and include > those lines here? > > >>> It is definitely that call to refresh that triggers the start because >>> I've >>> run a handful of tests and the time between reconnecting the network and >>> pcs resource refresh call varied by as much as 10 minutes. >>> >>> Any suggestion would be appreciated. >>> >>> Regards, >>> -John >>> >>> >>> >>> ___ >>> Manage your subscription: >>> https://lists.clusterlabs.org/mailman/listinfo/users >>> >>> ClusterLabs home: https://www.clusterlabs.org/ >>> >> >>
Re: [ClusterLabs] resource start after network reconnected
> On 18.11.2021 22:33, john tillman wrote: >> >> Greetings all, >> >> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5 >> >> I have a mysql resource, cloned, that is behaving the way I wanted. >> When >> the node it is on is unplugged from the network quorum is lost and the >> mysqld service stops. Great. Oh, and fencing is disabled. >> >> When the network connectivity is restored I'd like it to restart but it >> doesn't. What needs to be done to make this happen automatically? Or >> what section of the doc should reread more thoroughly? >> >> When mysql is stopped because of the above, if I run "pcs resource >> refresh" it starts? Any ideas why the "refresh" would do that? >> > > You provided zero information about your setup and how you configured > pacemaker to stop mysqld on network connectivity loss, so it is rather > hard to guess. > > Logs covering period when you unplug network, and later plug again, could > be also helpful. > Fair point. I didn't want to put too much into the first email. There are 3 nodes but 2 nodes are actually used for processing and the 3rd node is there just for quorum purposes. When quorum is lost my resources stop. There are 3 resources: a VIP, MySQL service, and controld (a project specific service). And this problem has now become intermittent as 1 in 4 tests this morning succeeded in starting mysqld when the network was reconnected. Figures :-/ More info. After reconnecting the network on spm238 the mysql resource was listed as: * spmDB (systemd:mysqld): FAILED spm238 (blocked) This was cleared and mysqld started after issuing a "pcs resource refresh". So as requested here's how I setup my cluster. It's copied from an ansible playbook so there are some variables shown but should be easy enough to understand. If not, I will gladly clarify anything. My 3 resources: pcs resource create spmVIP ocf:heartbeat:IPaddr2 ip={{ spmvip }} cidr_netmask=24 op monitor interval=10s pcs resource create spmControl systemd:controld op monitor interval=10s pcs resource create spmDB systemd:mysqld op monitor interval=10s clone My constraints: pcs constraint colocation add spmControl with spmVIP INFINITY pcs constraint colocation add spmVIP with spmDB-clone 200 crm_resource -r spmVIP -p resource-stickiness -m -v 100 crm_resource -r spmControl -p resource-stickiness -m -v 100 Don't run resources on the quorum only node: pcs constraint location spmVIP avoids {{ QOnlynode }}=INFINITY pcs constraint location spmControl avoids {{ QOnlynode }}=INFINITY pcs constraint location spmDB-clone avoids {{ QOnlynode }}=INFINITY and stonith is false: pcs property set stonith-enabled=false If you'd rather see the cib file I can supply that. With respect to logs, pacemaker.log has the most relevant info, right, but there's a lot. It's 900+ lines from the time I unplug the network until mysql is restarted by the 'pcs resource refresh'. Any suggestions for how to present the info here? Maybe use grep for some key words and include those lines here? >> It is definitely that call to refresh that triggers the start because >> I've >> run a handful of tests and the time between reconnecting the network and >> pcs resource refresh call varied by as much as 10 minutes. >> >> Any suggestion would be appreciated. >> >> Regards, >> -John >> >> >> >> ___ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ >> > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
> On Thu, Nov 18, 2021 at 03:42:48PM -0500, john tillman wrote: >> I don't believe I can since I do not have a fencing device available. > > As this page explains, fencing is required for the cluster to behave > correctly: > > https://www.alteeve.com/w/The_2-Node_Myth > > Can you share what kind of nodes are you working with? Perhaps some > simple form of fencing is possible. Thank you, Valentin. I'll add more information in my next response. > -- > Valentin > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
On 18.11.2021 22:33, john tillman wrote: > > Greetings all, > > preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5 > > I have a mysql resource, cloned, that is behaving the way I wanted. When > the node it is on is unplugged from the network quorum is lost and the > mysqld service stops. Great. Oh, and fencing is disabled. > > When the network connectivity is restored I'd like it to restart but it > doesn't. What needs to be done to make this happen automatically? Or > what section of the doc should reread more thoroughly? > > When mysql is stopped because of the above, if I run "pcs resource > refresh" it starts? Any ideas why the "refresh" would do that? > You provided zero information about your setup and how you configured pacemaker to stop mysqld on network connectivity loss, so it is rather hard to guess. Logs covering period when you unplug network, and later plug again, could be also helpful. > It is definitely that call to refresh that triggers the start because I've > run a handful of tests and the time between reconnecting the network and > pcs resource refresh call varied by as much as 10 minutes. > > Any suggestion would be appreciated. > > Regards, > -John > > > > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
On Thu, Nov 18, 2021 at 03:42:48PM -0500, john tillman wrote: > I don't believe I can since I do not have a fencing device available. As this page explains, fencing is required for the cluster to behave correctly: https://www.alteeve.com/w/The_2-Node_Myth Can you share what kind of nodes are you working with? Perhaps some simple form of fencing is possible. -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
> On Thu, Nov 18, 2021 at 02:33:28PM -0500, john tillman wrote: >> preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5 >> >> I have a mysql resource, cloned, that is behaving the way I wanted. >> When >> the node it is on is unplugged from the network quorum is lost and the >> mysqld service stops. Great. Oh, and fencing is disabled. > > Can you test how it behaves with fencing enabled? I don't believe I can since I do not have a fencing device available. > > -- > Valentin > ___ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > > ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
Re: [ClusterLabs] resource start after network reconnected
On Thu, Nov 18, 2021 at 02:33:28PM -0500, john tillman wrote: > preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5 > > I have a mysql resource, cloned, that is behaving the way I wanted. When > the node it is on is unplugged from the network quorum is lost and the > mysqld service stops. Great. Oh, and fencing is disabled. Can you test how it behaves with fencing enabled? -- Valentin ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
[ClusterLabs] resource start after network reconnected
Greetings all, preamble: RHEL8, PCS 0.10.8, COROSYNC 3.1.0, PACEMAKER 2.0.5 I have a mysql resource, cloned, that is behaving the way I wanted. When the node it is on is unplugged from the network quorum is lost and the mysqld service stops. Great. Oh, and fencing is disabled. When the network connectivity is restored I'd like it to restart but it doesn't. What needs to be done to make this happen automatically? Or what section of the doc should reread more thoroughly? When mysql is stopped because of the above, if I run "pcs resource refresh" it starts? Any ideas why the "refresh" would do that? It is definitely that call to refresh that triggers the start because I've run a handful of tests and the time between reconnecting the network and pcs resource refresh call varied by as much as 10 minutes. Any suggestion would be appreciated. Regards, -John ___ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/