Ken Gaillot <kgail...@redhat.com> wrote:
On Thu, 2017-11-30 at 11:58 +0000, Adam Spiers wrote:
Ken Gaillot <kgail...@redhat.com> wrote:
On Wed, 2017-11-29 at 14:22 +0000, Adam Spiers wrote:
[snipped]
Let's suppose further that the cluster configuration is such
that no stateful resources which could potentially conflict
with other nodes will ever get launched on that 5th node. For
example it might only host stateless clones, or resources with
require=nothing set, or it might not even host any resources at
all due to some temporary constraints which have been applied.
In those cases, what is to be gained from fencing? The only
thing I can think of is that using (say) IPMI to power-cycle
the node *might* fix whatever issue was preventing it from
joining the cluster. Are there any other reasons for fencing
in this case? It wouldn't help avoid any data corruption, at
least.
Just because constraints are telling the node it can't run a
resource doesn't mean the node isn't malfunctioning and running
it anyway. If the node can't tell us it's OK, we have to assume
it's not.
Sure, but even if it *is* running it, if it's not conflicting with
anything or doing any harm, is it really always better to fence
regardless?
There's a resource meta-attribute "requires" that says what a resource
needs to start. If it can't do any harm if it runs awry, you can set
requires="quorum" (or even "nothing").
So, that's sort of a way to let the cluster know that, but it doesn't
currently do what you're suggesting, since start-up fencing is purely
about the node and not about the resources. I suppose if the cluster
had no resources requiring fencing (or, to push it further, no such
resources that will be probed on that node), we could disable start-up
fencing, but that's not done currently.
Yeah, that's the kind of thing I was envisaging.
Disclaimer: to a certain extent I'm playing devil's advocate here to
stimulate a closer (re-)examination of the axiom we've grown so used
to over the years that if we don't know what a node is doing, we
should fence it. I'm not necessarily arguing that fencing is wrong
here, but I think it's healthy to occasionally go back to first
principles and re-question why we are doing things a certain way, to
make sure that the original assumptions still hold true. I'm
familiar with the pain that our customers experience when nodes are
fenced for less than very compelling reasons, so I think it's worth
looking for opportunities to reduce fencing to when it's really
needed.
The fundamental purpose of a high-availability cluster is to keep the
desired service functioning, above all other priorities (including,
unfortunately, making sysadmins' lives easier).
If a service requires an HA cluster, it's a safe bet it will have
problems in a split-brain situation (otherwise, why bother with the
overhead). Even something as simple as an IP address will render a
service useless if it's brought up on two machines on a network.
Fencing is really the only hammer we have in that situation. At that
point, we have zero information about what the node is doing. If it's
powered off (or cut off from disk/network), we know it's not doing
anything.
Fencing may not always help the situation, but it's all we've got.
Sure, but I'm not (necessarily) even talking about a split-brain
situation. For example what if a cluster with remote nodes is shut
down cleanly, and then all the core nodes boot up cleanly but none of
the remote nodes are powered on till hours or even days later?
If I understand Yan correctly, in this situation all the remotes will
be marked as needing fencing, and this is the bit that doesn't make
sense to me. If Pacemaker can't reach *any* remotes, it can't start
any resources on those remotes, so (in the case where resources are
partitioned cleanly into those which run on remotes vs. those which
don't) there is no danger of any concurrency violation. So fencing
remotes before you can use them is totally pointless. Surely fencing
of node A should only happen when Pacemaker is ready to start resource
X on node B which might already be running on node A. But if no such
node B exists then fencing is overkill. It would be better to wait
until the first remote joins the cluster, at which point Pacemaker can
assess its current state and decide the best course of action.
Otherwise it's like cutting your nose to spite your face.
In fact, in the particular scenario which caused me to trigger this
whole discussion, I suspect the above also applies even if some
remotes joined the newly booted cluster quickly whilst others still
take hours or days to boot - because in that scenario it is
additionally safe to assume that none of the resources managed on
those remotes by pacemaker_remoted would ever be started by anything
other than pacemaker_remoted, since a) the whole system is configured
automatically in a way which ensures the managed services won't
automatically start at boot via systemd, and b) if someone started
them manually, they would invalidate the warranty on that cluster ;-)
Therefore we know that if a remote node has not yet joined the newly
booted cluster, it can't be running anything which would conflict with
the other remotes.
We give the user a good bit of control over fencing policies: corosync
tuning, stonith-enabled, startup-fencing, no-quorum-policy, requires,
on-fail, and the choice of fence agent. It can be a challenge for a new
user to know all the knobs to turn, but HA is kind of unavoidably
complex.
Indeed. I just haven't figured out how to configure the cluster for
the above scenario yet, so that it doesn't always fence latecomer
remote nodes.
[snipped]
Also, when exactly does the dc-deadtime timer start ticking?
Is it reset to zero after a node is fenced, so that potentially
that node could go into a reboot loop if dc-deadtime is set too
low?
A node's crmd starts the timer at start-up and whenever a new
election starts, and is stopped when the DC makes it a join
offer.
That's surprising - I would have expected it to be the other way
around, i.e. that the timer doesn't run on the node which is joining,
but one of the nodes already in the cluster (e.g. the DC). Otherwise
how can fencing of that node be triggered if the node takes too long
to join?
I don't think it ever reboots though, I think it just starts a new
election.
Maybe we're talking at cross-purposes? By "reboot loop", I was
asking if the node which fails to join could end up getting
endlessly fenced: join timeout -> fenced -> reboots -> join timeout
->fenced -> ... etc.
startup-fencing and dc-deadtime don't have anything to do with each
other.
There are two separate joins: the node joins at the corosync layer, and
then its crmd joins to the other crmd's at the pacemaker layer. One of
the crmd's is then elected DC.
startup-fencing kicks in if the cluster has quorum and the DC sees no
node status in the CIB for a node. Node status will be recorded in the
CIB once it joins at the corosync layer. So, all nodes have until
quorum is reached, a DC is elected, and the DC invokes the policy
engine, to join at the cluster layer, else they will be shot. (And at
that time, their status is known and recorded as dead.) This only
happens when the cluster first starts, and is the only way to handle
split-brain at start-up.
dc-deadtime is for the DC election. When a node joins an existing
cluster, it expects the existing DC to make it a membership offer (at
the pacemaker layer). If that doesn't happen within dc-deadtime, the
node asks for a new DC election. The idea is that the DC may be having
trouble that hasn't been detected yet. Similarly, whenever a new
election is called, all of the nodes expect a join offer from whichever
node is elected DC, and again they call a new election if that doesn't
happen in dc-deadtime.
Ahah OK thanks, that's super helpful! I don't suppose it's documented
anywhere? I didn't find it in Pacemaker Explained, at least.
[snipped]
I have an uncomfortable feeling that I'm missing something
obvious, probably due to the documentation's warning that "Not
using the default [for startup-fencing] is very unsafe!" Or is
it only unsafe when the resource which exceeded dc-deadtime on
startup could potentially be running a stateful resource which
the cluster now wants to restart elsewhere? If that's the
case, would it be possible to optionally limit startup fencing
to when it's really needed?
Thanks for any light you can shed!
There's no automatic mechanism to know that, but if you know
before a particular start that certain nodes are really down and
are staying that way, you can disable start-up fencing in the
configuration on disk, before starting the other nodes, then
re-enable it once everything is back to normal.
Ahah! That's the kind of tip I was looking for, thanks :-) So you
mean by editing the CIB XML directly? Would disabling startup-
fencing manually this way require a concurrent update of the epoch?
You can edit the CIB on disk when the cluster is down, but you have to
go about it carefully.
Rather than edit it directly, you can use
CIB_file=/var/lib/pacemaker/cib/cib.xml when invoking cibadmin (or your
favorite higher-level tool). cibadmin will update the hash that
pacemaker uses to verify the CIB's integrity. Alternatively, you can
remove *everything* in /var/lib/pacemaker/cib except cib.xml, then edit
it directly.
Updating the admin epoch is a good idea if you want to be sure your
edited CIB wins, although starting that node first is also good enough.
Again, great info which deserves to be documented if it isn't already ;-)
Thanks a lot for the really helpful replies!
_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org