Ken Gaillot <kgail...@redhat.com> wrote:
On Wed, 2017-11-29 at 14:22 +0000, Adam Spiers wrote:
Hi all,

A colleague has been valiantly trying to help me belatedly learn
about
the intricacies of startup fencing, but I'm still not fully
understanding some of the finer points of the behaviour.

The documentation on the "startup-fencing" option[0] says

    Advanced Use Only: Should the cluster shoot unseen nodes? Not
    using the default is very unsafe!

and that it defaults to TRUE, but doesn't elaborate any further:

    https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacema
ker_Explained/s-cluster-options.html

Let's imagine the following scenario:

- We have a 5-node cluster, with all nodes running cleanly.

- The whole cluster is shut down cleanly.

- The whole cluster is then started up again.  (Side question: what
  happens if the last node to shut down is not the first to start up?
  How will the cluster ensure it has the most recent version of the
  CIB?  Without that, how would it know whether the last man standing
  was shut down cleanly or not?)

Of course, the cluster can't know what CIB version nodes it doesn't see
have, so if a set of nodes is started with an older version, it will go
with that.

Right, that's what I expected.

However, a node can't do much without quorum, so it would be difficult
to get in a situation where CIB changes were made with quorum before
shutdown, but none of those nodes are present at the next start-up with
quorum.

In any case, when a new node joins a cluster, the nodes do compare CIB
versions. If the new node has a newer CIB, the cluster will use it. If
other changes have been made since then, the newest CIB wins, so one or
the other's changes will be lost.

Ahh, that's interesting.  Based on reading

   
https://clusterlabs.org/doc/en-US/Pacemaker/1.1-crmsh/html/Pacemaker_Explained/ch03.html#_cib_properties

whichever node has the highest (admin_epoch, epoch, num_updates) tuple
will win, so normally in this scenario it would be the epoch which
decides it, i.e. whichever node had the most changes since the last
time the conflicting nodes shared the same config - right?

And if that would choose the wrong node, admin_epoch can be set
manually to override that decision?

Whether missing nodes were shut down cleanly or not relates to your
next question ...

- 4 of the nodes boot up fine and rejoin the cluster within the
  dc-deadtime interval, foruming a quorum, but the 5th doesn't.

IIUC, with startup-fencing enabled, this will result in that 5th node
automatically being fenced.  If I'm right, is that really *always*
necessary?

It's always safe. :-) As you mentioned, if the missing node was the
last one alive in the previous run, the cluster can't know whether it
shut down cleanly or not. Even if the node was known to shut down
cleanly in the last run, the cluster still can't know whether the node
was started since then and is now merely unreachable. So, fencing is
necessary to ensure it's not accessing resources.

I get that, but I was questioning the "necessary to ensure it's not
accessing resources" part of this statement.  My point is that
sometimes this might be overkill, because sometimes we might be able to
discern through other methods that there are no resources we need to
worry about potentially conflicting with what we want to run.  That's
why I gave the stateless clones example.

The same scenario is why a single node can't have quorum at start-up in
a cluster with "two_node" set. Both nodes have to see each other at
least once before they can assume it's safe to do anything.

Yep.

Let's suppose further that the cluster configuration is such that no
stateful resources which could potentially conflict with other nodes
will ever get launched on that 5th node.  For example it might only
host stateless clones, or resources with require=nothing set, or it
might not even host any resources at all due to some temporary
constraints which have been applied.

In those cases, what is to be gained from fencing?  The only thing I
can think of is that using (say) IPMI to power-cycle the node *might*
fix whatever issue was preventing it from joining the cluster.  Are
there any other reasons for fencing in this case?  It wouldn't help
avoid any data corruption, at least.

Just because constraints are telling the node it can't run a resource
doesn't mean the node isn't malfunctioning and running it anyway. If
the node can't tell us it's OK, we have to assume it's not.

Sure, but even if it *is* running it, if it's not conflicting with
anything or doing any harm, is it really always better to fence
regardless?

Disclaimer: to a certain extent I'm playing devil's advocate here to
stimulate a closer (re-)examination of the axiom we've grown so used
to over the years that if we don't know what a node is doing, we
should fence it.  I'm not necessarily arguing that fencing is wrong
here, but I think it's healthy to occasionally go back to first
principles and re-question why we are doing things a certain way, to
make sure that the original assumptions still hold true.  I'm familiar
with the pain that our customers experience when nodes are fenced for
less than very compelling reasons, so I think it's worth looking for
opportunities to reduce fencing to when it's really needed.

Now let's imagine the same scenario, except rather than a clean full
cluster shutdown, all nodes were affected by a power cut, but also
this time the whole cluster is configured to *only* run stateless
clones, so there is no risk of conflict between two nodes
accidentally
running the same resource.  On startup, the 4 nodes in the quorum
have
no way of knowing that the 5th node was also affected by the power
cut, so in theory from their perspective it could still be running a
stateless clone.  Again, is there anything to be gained from fencing
the 5th node once it exceeds the dc-deadtime threshold for joining,
other than the chance that a reboot might fix whatever was preventing
it from joining, and get the cluster back to full strength?

If a cluster runs only services that have no potential to conflict,
then you don't need a cluster. :-)

True :-)  Again as devil's advocate this scenario could be extended to
include remote nodes which *do* run resources which could conflict
(such as compute nodes), and in that case running stateless clones
(only) on the core cluster could be justified simply on the grounds
that we need Pacemaker for the remotes anyway, so we might as well use
it for the stateless clones rather than introducing keepalived as yet
another component ... but this is starting to get hypothetical, so
it's perhaps not worth spending energy discussing that tangent ;-)

Unique clones require communication even if they're stateless (think
IPaddr2).

Well yeah, IPaddr2 is arguably stateful since there are changing ARP
tables involved :-)

I'm pretty sure even some anonymous stateless clones require
communication to avoid issues.

Fair enough.

Also, when exactly does the dc-deadtime timer start ticking?
Is it reset to zero after a node is fenced, so that potentially that
node could go into a reboot loop if dc-deadtime is set too low?

A node's crmd starts the timer at start-up and whenever a new election
starts, and is stopped when the DC makes it a join offer.

That's surprising - I would have expected it to be the other way
around, i.e. that the timer doesn't run on the node which is joining,
but one of the nodes already in the cluster (e.g. the DC).  Otherwise
how can fencing of that node be triggered if the node takes too long
to join?

I don't think it ever reboots though, I think it just starts a new
election.

Maybe we're talking at cross-purposes?  By "reboot loop", I was asking
if the node which fails to join could end up getting endlessly fenced:
join timeout -> fenced -> reboots -> join timeout -> fenced -> ... etc.

So, you can get into an election loop, but I think network conditions
would have to be pretty severe.

Yeah, that sounds like a different type of loop to the one I was
imagining.

The same questions apply if this troublesome node was actually a
remote node running pacemaker_remoted, rather than the 5th node in
the
cluster.

Remote nodes don't join at the crmd level as cluster nodes do, so they
don't "start up" in the same sense

Sure, they establish a TCP connection via pacemaker_remoted when the
remote resource is starting.

and start-up fencing doesn't apply to them.  Instead, the cluster
initiates the connection when called for (I don't remember for sure
whether it fences the remote node if the connection fails, but that
would make sense).

Hrm, that's not what Yan said, and it's not what my L3 colleagues are
reporting either ;-)  I've been told (but not yet verified myself)
that if a remote resource's start operation times out (e.g. due to
the remote node not being up yet), the remote will get fenced.
But I see Yan has already replied with additional details on this.

I have an uncomfortable feeling that I'm missing something obvious,
probably due to the documentation's warning that "Not using the
default [for startup-fencing] is very unsafe!"  Or is it only
unsafe when the resource which exceeded dc-deadtime on startup
could potentially be running a stateful resource which the cluster
now wants to restart elsewhere?  If that's the case, would it be
possible to optionally limit startup fencing to when it's really
needed?

Thanks for any light you can shed!

There's no automatic mechanism to know that, but if you know before a
particular start that certain nodes are really down and are staying
that way, you can disable start-up fencing in the configuration on
disk, before starting the other nodes, then re-enable it once
everything is back to normal.

Ahah!  That's the kind of tip I was looking for, thanks :-)  So you
mean by editing the CIB XML directly?  Would disabling startup-fencing
manually this way require a concurrent update of the epoch?

_______________________________________________
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Reply via email to