On Fri, 2021-11-19 at 14:57 -0500, john tillman wrote: > > On Fri, 2021-11-19 at 10:40 -0500, john tillman wrote: > > > > <snip> > > > > > > If pacemaker tries to stop resources due to out of quorum > > > > condition, you > > > > could set suitable failure-timeout; this will be equivalent to > > > > using "pcs > > > > resource refresh". Keep in mind that pacemaker only checks for > > > > failure-timeout expiration every cluster-recheck-interval (15 > > > > That's true only for Pacemaker versions less than 2.0.3; since > > 2.0.3, > > the cluster rechecks as soon as the timeout hits. > > I'm using pacemaker 2.0.5 and it is *not* starting MySQL when quorum > is > restored, at least not every time (~1 in 10). So I have seen it work
That's due to a stop failure, not the recheck interval > before but I'm more willing to believe that there was a user error in > that > one successful sample. > > We (actual a team mate) got mysql to start when quorum is > restored. It > required both setting the cluster-recheck-interval to something more > frequent than 15min and setting the mysql resource's failure- > timeout to > non-zero. In our case we set both to 1 minute with good results for > the > last few tests. We can raise the frequency to something greater than > 1 > but for our tests, 1 proves it out. The failure-timeout is equivalent to running refresh when the timeout hits. The cluster will then re-probe the status of the resource and decide what, if anything, needs to be done about it. I can only see that working if the stop failure is transient -- i.e., either the stop actually succeeded but returned a failure code (or maybe timed out), and when the failure timeout or refresh happens, the re-probe sees the database is actually not running; or the stop really does fail, but by the time the failure timeout or refresh happens, another stop attempt after the re-probe is able to succeed. > > > > > > minutes by > > > > default). This still is not directly related to network > > > > availability, but > > > > if network outage resulted in node going out of quorum, when > > > > network is > > > > back and node joined cluster again it will allow resources to > > > > be > > > > started > > > > on node. > > > > > > > > > > When quorum is lost I want all the resources to stop. The > > > cluster is > > > performing this step correctly for me. > > > > As long as it's working properly. If quorum is lost because one of > > the > > nodes is malfunctioning -- maybe a device driver locked up the > > system, > > or CPU wait is horrific due to an out-of-control process or disk > > failure -- then that node will not know quorum has been lost and > > will > > not stop resources. If the condition then clears up, suddenly you > > have > > split-brain with two nodes running resources. > > > > > That cluster-recheck-interval would explain the intermittence I > > > saw > > > this > > > morning. If I set that to 1 minute would that cause any gross > > > negative > > > issues? > > > > It increases CPU usage and IPC traffic. For Pacemaker 2.0.3 or > > later, I > > definitely wouldn't bother. For older versions, 1 minute feels a > > bit > > much, I would go with around 5. > > > > > Is there another setting besides cluster-recheck-interval to > > > consider > > > adjusting to start mysql when quorum is returned? > > > > > > Thank you for the feedback. > > > > > > -John > > > > -- > > Ken Gaillot <[email protected]> > > > > _______________________________________________ > > Manage your subscription: > > https://lists.clusterlabs.org/mailman/listinfo/users > > > > ClusterLabs home: https://www.clusterlabs.org/ > > > > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ > -- Ken Gaillot <[email protected]> _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
