Re: [ClusterLabs] questions about startup fencing

Adam Spiers Wed, 06 Dec 2017 09:47:28 -0800

Ken Gaillot <kgail...@redhat.com> wrote:

On Thu, 2017-11-30 at 11:58 +0000, Adam Spiers wrote:
Ken Gaillot <kgail...@redhat.com> wrote:
On Wed, 2017-11-29 at 14:22 +0000, Adam Spiers wrote:


[snipped]

Let's suppose further that the cluster configuration is suchthat no stateful resources which could potentially conflictwith other nodes will ever get launched on that 5th node. Forexample it might only host stateless clones, or resources withrequire=nothing set, or it might not even host any resources atall due to some temporary constraints which have been applied.In those cases, what is to be gained from fencing? The onlything I can think of is that using (say) IPMI to power-cyclethe node *might* fix whatever issue was preventing it fromjoining the cluster. Are there any other reasons for fencingin this case? It wouldn't help avoid any data corruption, atleast.
Just because constraints are telling the node it can't run aresource doesn't mean the node isn't malfunctioning and runningit anyway. If the node can't tell us it's OK, we have to assumeit's not.
Sure, but even if it *is* running it, if it's not conflicting withanything or doing any harm, is it really always better to fenceregardless?
There's a resource meta-attribute "requires" that says what a resourceneeds to start. If it can't do any harm if it runs awry, you can setrequires="quorum" (or even "nothing").So, that's sort of a way to let the cluster know that, but it doesn'tcurrently do what you're suggesting, since start-up fencing is purelyabout the node and not about the resources. I suppose if the clusterhad no resources requiring fencing (or, to push it further, no suchresources that will be probed on that node), we could disable start-upfencing, but that's not done currently.

Yeah, that's the kind of thing I was envisaging.

Disclaimer: to a certain extent I'm playing devil's advocate here tostimulate a closer (re-)examination of the axiom we've grown so usedto over the years that if we don't know what a node is doing, weshould fence it. I'm not necessarily arguing that fencing is wronghere, but I think it's healthy to occasionally go back to firstprinciples and re-question why we are doing things a certain way, tomake sure that the original assumptions still hold true. I'mfamiliar with the pain that our customers experience when nodes arefenced for less than very compelling reasons, so I think it's worthlooking for opportunities to reduce fencing to when it's reallyneeded.
The fundamental purpose of a high-availability cluster is to keep thedesired service functioning, above all other priorities (including,unfortunately, making sysadmins' lives easier).If a service requires an HA cluster, it's a safe bet it will haveproblems in a split-brain situation (otherwise, why bother with theoverhead). Even something as simple as an IP address will render aservice useless if it's brought up on two machines on a network.Fencing is really the only hammer we have in that situation. At thatpoint, we have zero information about what the node is doing. If it'spowered off (or cut off from disk/network), we know it's not doinganything.
Fencing may not always help the situation, but it's all we've got.

Sure, but I'm not (necessarily) even talking about a split-brainsituation. For example what if a cluster with remote nodes is shutdown cleanly, and then all the core nodes boot up cleanly but none ofthe remote nodes are powered on till hours or even days later?If I understand Yan correctly, in this situation all the remotes willbe marked as needing fencing, and this is the bit that doesn't makesense to me. If Pacemaker can't reach *any* remotes, it can't startany resources on those remotes, so (in the case where resources arepartitioned cleanly into those which run on remotes vs. those whichdon't) there is no danger of any concurrency violation. So fencingremotes before you can use them is totally pointless. Surely fencingof node A should only happen when Pacemaker is ready to start resourceX on node B which might already be running on node A. But if no suchnode B exists then fencing is overkill. It would be better to waituntil the first remote joins the cluster, at which point Pacemaker canassess its current state and decide the best course of action.Otherwise it's like cutting your nose to spite your face.In fact, in the particular scenario which caused me to trigger thiswhole discussion, I suspect the above also applies even if someremotes joined the newly booted cluster quickly whilst others stilltake hours or days to boot - because in that scenario it isadditionally safe to assume that none of the resources managed onthose remotes by pacemaker_remoted would ever be started by anythingother than pacemaker_remoted, since a) the whole system is configuredautomatically in a way which ensures the managed services won'tautomatically start at boot via systemd, and b) if someone startedthem manually, they would invalidate the warranty on that cluster ;-)Therefore we know that if a remote node has not yet joined the newlybooted cluster, it can't be running anything which would conflict withthe other remotes.

We give the user a good bit of control over fencing policies: corosynctuning, stonith-enabled, startup-fencing, no-quorum-policy, requires,on-fail, and the choice of fence agent. It can be a challenge for a newuser to know all the knobs to turn, but HA is kind of unavoidablycomplex.

Indeed. I just haven't figured out how to configure the cluster forthe above scenario yet, so that it doesn't always fence latecomerremote nodes.

[snipped]

Also, when exactly does the dc-deadtime timer start ticking?Is it reset to zero after a node is fenced, so that potentiallythat node could go into a reboot loop if dc-deadtime is set toolow?
A node's crmd starts the timer at start-up and whenever a newelection starts, and is stopped when the DC makes it a joinoffer.
That's surprising - I would have expected it to be the other wayaround, i.e. that the timer doesn't run on the node which is joining,but one of the nodes already in the cluster (e.g. the DC). Otherwisehow can fencing of that node be triggered if the node takes too longto join?
I don't think it ever reboots though, I think it just starts a newelection.
Maybe we're talking at cross-purposes? By "reboot loop", I wasasking if the node which fails to join could end up gettingendlessly fenced: join timeout -> fenced -> reboots -> join timeout->fenced -> ... etc.
startup-fencing and dc-deadtime don't have anything to do with eachother.
There are two separate joins: the node joins at the corosync layer, andthen its crmd joins to the other crmd's at the pacemaker layer. One ofthe crmd's is then elected DC.startup-fencing kicks in if the cluster has quorum and the DC sees nonode status in the CIB for a node. Node status will be recorded in theCIB once it joins at the corosync layer. So, all nodes have untilquorum is reached, a DC is elected, and the DC invokes the policyengine, to join at the cluster layer, else they will be shot. (And atthat time, their status is known and recorded as dead.) This onlyhappens when the cluster first starts, and is the only way to handlesplit-brain at start-up.dc-deadtime is for the DC election. When a node joins an existingcluster, it expects the existing DC to make it a membership offer (atthe pacemaker layer). If that doesn't happen within dc-deadtime, thenode asks for a new DC election. The idea is that the DC may be havingtrouble that hasn't been detected yet. Similarly, whenever a newelection is called, all of the nodes expect a join offer from whichevernode is elected DC, and again they call a new election if that doesn'thappen in dc-deadtime.

Ahah OK thanks, that's super helpful! I don't suppose it's documentedanywhere? I didn't find it in Pacemaker Explained, at least.

[snipped]

I have an uncomfortable feeling that I'm missing somethingobvious, probably due to the documentation's warning that "Notusing the default [for startup-fencing] is very unsafe!" Or isit only unsafe when the resource which exceeded dc-deadtime onstartup could potentially be running a stateful resource whichthe cluster now wants to restart elsewhere? If that's thecase, would it be possible to optionally limit startup fencingto when it's really needed?Thanks for any light you can shed!
There's no automatic mechanism to know that, but if you knowbefore a particular start that certain nodes are really down andare staying that way, you can disable start-up fencing in theconfiguration on disk, before starting the other nodes, thenre-enable it once everything is back to normal.
Ahah! That's the kind of tip I was looking for, thanks :-) So youmean by editing the CIB XML directly? Would disabling startup-fencing manually this way require a concurrent update of the epoch?
You can edit the CIB on disk when the cluster is down, but you have togo about it carefully.Rather than edit it directly, you can useCIB_file=/var/lib/pacemaker/cib/cib.xml when invoking cibadmin (or yourfavorite higher-level tool). cibadmin will update the hash thatpacemaker uses to verify the CIB's integrity. Alternatively, you canremove *everything* in /var/lib/pacemaker/cib except cib.xml, then editit directly.Updating the admin epoch is a good idea if you want to be sure youredited CIB wins, although starting that node first is also good enough.

Again, great info which deserves to be documented if it isn't already ;-)

Thanks a lot for the really helpful replies!

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [ClusterLabs] questions about startup fencing

Reply via email to