Ken Gaillot <kgail...@redhat.com> wrote:
On Thu, 2017-11-30 at 11:58 +0000, Adam Spiers wrote:
Ken Gaillot <kgail...@redhat.com> wrote:
On Wed, 2017-11-29 at 14:22 +0000, Adam Spiers wrote:

[snipped]

Let's suppose further that the cluster configuration is such that no stateful resources which could potentially conflict with other nodes will ever get launched on that 5th node. For example it might only host stateless clones, or resources with require=nothing set, or it might not even host any resources at all due to some temporary constraints which have been applied. In those cases, what is to be gained from fencing? The only thing I can think of is that using (say) IPMI to power-cycle the node *might* fix whatever issue was preventing it from joining the cluster. Are there any other reasons for fencing in this case? It wouldn't help avoid any data corruption, at least.

Just because constraints are telling the node it can't run a resource doesn't mean the node isn't malfunctioning and running it anyway. If the node can't tell us it's OK, we have to assume it's not.

Sure, but even if it *is* running it, if it's not conflicting with anything or doing any harm, is it really always better to fence regardless?

There's a resource meta-attribute "requires" that says what a resource needs to start. If it can't do any harm if it runs awry, you can set requires="quorum" (or even "nothing"). So, that's sort of a way to let the cluster know that, but it doesn't currently do what you're suggesting, since start-up fencing is purely about the node and not about the resources. I suppose if the cluster had no resources requiring fencing (or, to push it further, no such resources that will be probed on that node), we could disable start-up fencing, but that's not done currently.

Yeah, that's the kind of thing I was envisaging.
Disclaimer: to a certain extent I'm playing devil's advocate here to stimulate a closer (re-)examination of the axiom we've grown so used to over the years that if we don't know what a node is doing, we should fence it. I'm not necessarily arguing that fencing is wrong here, but I think it's healthy to occasionally go back to first principles and re-question why we are doing things a certain way, to make sure that the original assumptions still hold true. I'm familiar with the pain that our customers experience when nodes are fenced for less than very compelling reasons, so I think it's worth looking for opportunities to reduce fencing to when it's really needed.

The fundamental purpose of a high-availability cluster is to keep the desired service functioning, above all other priorities (including, unfortunately, making sysadmins' lives easier). If a service requires an HA cluster, it's a safe bet it will have problems in a split-brain situation (otherwise, why bother with the overhead). Even something as simple as an IP address will render a service useless if it's brought up on two machines on a network. Fencing is really the only hammer we have in that situation. At that point, we have zero information about what the node is doing. If it's powered off (or cut off from disk/network), we know it's not doing anything.

Fencing may not always help the situation, but it's all we've got.

Sure, but I'm not (necessarily) even talking about a split-brain situation. For example what if a cluster with remote nodes is shut down cleanly, and then all the core nodes boot up cleanly but none of the remote nodes are powered on till hours or even days later? If I understand Yan correctly, in this situation all the remotes will be marked as needing fencing, and this is the bit that doesn't make sense to me. If Pacemaker can't reach *any* remotes, it can't start any resources on those remotes, so (in the case where resources are partitioned cleanly into those which run on remotes vs. those which don't) there is no danger of any concurrency violation. So fencing remotes before you can use them is totally pointless. Surely fencing of node A should only happen when Pacemaker is ready to start resource X on node B which might already be running on node A. But if no such node B exists then fencing is overkill. It would be better to wait until the first remote joins the cluster, at which point Pacemaker can assess its current state and decide the best course of action. Otherwise it's like cutting your nose to spite your face. In fact, in the particular scenario which caused me to trigger this whole discussion, I suspect the above also applies even if some remotes joined the newly booted cluster quickly whilst others still take hours or days to boot - because in that scenario it is additionally safe to assume that none of the resources managed on those remotes by pacemaker_remoted would ever be started by anything other than pacemaker_remoted, since a) the whole system is configured automatically in a way which ensures the managed services won't automatically start at boot via systemd, and b) if someone started them manually, they would invalidate the warranty on that cluster ;-) Therefore we know that if a remote node has not yet joined the newly booted cluster, it can't be running anything which would conflict with the other remotes.
We give the user a good bit of control over fencing policies: corosync tuning, stonith-enabled, startup-fencing, no-quorum-policy, requires, on-fail, and the choice of fence agent. It can be a challenge for a new user to know all the knobs to turn, but HA is kind of unavoidably complex.

Indeed. I just haven't figured out how to configure the cluster for the above scenario yet, so that it doesn't always fence latecomer remote nodes.
[snipped]

Also, when exactly does the dc-deadtime timer start ticking? Is it reset to zero after a node is fenced, so that potentially that node could go into a reboot loop if dc-deadtime is set too low?

A node's crmd starts the timer at start-up and whenever a new election starts, and is stopped when the DC makes it a join offer.

That's surprising - I would have expected it to be the other way around, i.e. that the timer doesn't run on the node which is joining, but one of the nodes already in the cluster (e.g. the DC). Otherwise how can fencing of that node be triggered if the node takes too long to join?

I don't think it ever reboots though, I think it just starts a new election.

Maybe we're talking at cross-purposes? By "reboot loop", I was asking if the node which fails to join could end up getting endlessly fenced: join timeout -> fenced -> reboots -> join timeout ->fenced -> ... etc.

startup-fencing and dc-deadtime don't have anything to do with each other.

There are two separate joins: the node joins at the corosync layer, and then its crmd joins to the other crmd's at the pacemaker layer. One of the crmd's is then elected DC. startup-fencing kicks in if the cluster has quorum and the DC sees no node status in the CIB for a node. Node status will be recorded in the CIB once it joins at the corosync layer. So, all nodes have until quorum is reached, a DC is elected, and the DC invokes the policy engine, to join at the cluster layer, else they will be shot. (And at that time, their status is known and recorded as dead.) This only happens when the cluster first starts, and is the only way to handle split-brain at start-up. dc-deadtime is for the DC election. When a node joins an existing cluster, it expects the existing DC to make it a membership offer (at the pacemaker layer). If that doesn't happen within dc-deadtime, the node asks for a new DC election. The idea is that the DC may be having trouble that hasn't been detected yet. Similarly, whenever a new election is called, all of the nodes expect a join offer from whichever node is elected DC, and again they call a new election if that doesn't happen in dc-deadtime.

Ahah OK thanks, that's super helpful! I don't suppose it's documented anywhere? I didn't find it in Pacemaker Explained, at least.
[snipped]

I have an uncomfortable feeling that I'm missing something obvious, probably due to the documentation's warning that "Not using the default [for startup-fencing] is very unsafe!" Or is it only unsafe when the resource which exceeded dc-deadtime on startup could potentially be running a stateful resource which the cluster now wants to restart elsewhere? If that's the case, would it be possible to optionally limit startup fencing to when it's really needed? Thanks for any light you can shed!

There's no automatic mechanism to know that, but if you know before a particular start that certain nodes are really down and are staying that way, you can disable start-up fencing in the configuration on disk, before starting the other nodes, then re-enable it once everything is back to normal.

Ahah! That's the kind of tip I was looking for, thanks :-) So you mean by editing the CIB XML directly? Would disabling startup- fencing manually this way require a concurrent update of the epoch?

You can edit the CIB on disk when the cluster is down, but you have to go about it carefully. Rather than edit it directly, you can use CIB_file=/var/lib/pacemaker/cib/cib.xml when invoking cibadmin (or your favorite higher-level tool). cibadmin will update the hash that pacemaker uses to verify the CIB's integrity. Alternatively, you can remove *everything* in /var/lib/pacemaker/cib except cib.xml, then edit it directly. Updating the admin epoch is a good idea if you want to be sure your edited CIB wins, although starting that node first is also good enough.

Again, great info which deserves to be documented if it isn't already ;-)
Thanks a lot for the really helpful replies!

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to