lge> > I would have expected corosync to come back with a "stable lge> > non‑quorate membership" of just itself within a very short lge> > period of time, and pacemaker winning the lge> > "election"/"integration" with just itself, and then trying lge> > to call "stop" on everything it knows about. ken> ken> That's what I'd expect, too. I'm guessing the corosync cycling is ken> what's causing the pacemaker cycling, so I'd focus on corosync first.
Any Corosync folks around with some input? What may cause corosync on an isolated (with iptables DROP rules) node to keep creating "new membership" with only itself? Is it a problem with the test setup maybe? Does an isolated corosync node need to be able to send the token to itself? Do the "iptables DROP" rules on the outgoing interfaces prevent that? On Tue, Jun 01, 2021 at 10:31:21AM -0500, kgail...@redhat.com wrote: > On Tue, 2021-06-01 at 13:18 +0200, Ulrich Windl wrote: > > Hi! > > > > I can't answer, but I doubt the usefulness of > > "no-quorum-policy=stop": If nodes loose quorum, they try to > > stop all resources, but "remain" in the cluster (will respond > > to network queries (if any arrive). If one of those "stop"s > > fails, the other part of the cluster never knows. So what can > > be done? Should the "other(left)" part of the cluster start > > resources, assuming the "other(right)" part of the cluster had > > stopped resources successfully? > > no-quorum-policy only affects what the non-quorate partition will do. > The quorate partition will still fence the non-quorate part if it is > able, regardless of no-quorum-policy, and won't recover resources until > fencing succeeds. The context in this case is: "fencing by storage". DRBD 9 has a "drbd quorum" feature, where you can ask it to throw IO errors (or freeze) if DRBD quorum is lost, so data integrity on network partition is protected, even without fencing on the pacemaker level. It is rather a "convenience" that the non-quorate pacemaker on the isolated node should stop everything that still "survived", especially the umount is necessary for DRBD on that node to become secondary again, which is necessary to be able to re-integrate later when connectivity is restored. Yes, fencing on the node level is still necessary for other scenarios. But with certain scenarios, avoiding a node level fence while still being able to also avoid "trouble" once connectivity is restored would be nice. And would work nicely here, if the corosync membership of the isolated node would be stable enough for pacemaker to finalize "integration" with itself and then (try to) stop everything, so we have a truely "idle" node when connectivity is restored. "trouble": spurious restart of services ("resource too active ..."), problems with re-connecting DRBD ("two primaries not allowed") > > > pcmk 2.0.5, corosync 3.1.0, knet, rhel8 > > > I know fencing "solves" this just fine. > > > > > > what I'd like to understand though is: what exactly is > > > corosync or pacemaker waiting for here, why does it not > > > manage to get to the stage where it would even attempt to > > > "stop" stuff? > > > > > > two "rings" aka knet interfaces. > > > node isolation test with iptables, > > > INPUT/OUTPUT ‑j DROP on one interface, > > > shortly after on the second as well. > > > node loses quorum (obviously). > > > > > > pacemaker is expected to no‑quorum‑policy=stop, > > > but is "stuck" in Election ‑> Integration, > > > while corosync "cycles" bewteen "new membership" (with only > > > itself, obviously) and "token has not been received in ...", > > > "sync members ...", "new membership has formed ..." _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/