Henry Robinson wrote:
Effectively, EC2 does not introduce any new failure modes but potentially
exacerbates some existing ones. If a majority of EC2 nodes fail (in the
sense that their hard drive images cannot be recovered), there is no way to
restart the cluster, and persistence is lost. As you say, this is highly
unlikely. If, for some reason, the quorums are set such that only a single
node failure could bring down the quorum (bad design, but plausible), this
failure is more likely.
This is not strictly true. The cluster cannot recover _automatically_ if
failures > n, where ensemble size is 2n+1. However you can recover
manually as long as at least 1 snap and trailing logs can be recovered.
We can even recover if the latest snapshots are corrupted, as long as we
can recover a snap from some previous time t and all logs subsequent to t.
EC2 just ups the stakes - crash failures are now potentially more dangerous
(bugs, packet corruption, rack local hardware failures etc all could cause
crash failures). It is common to assume that, notwithstanding a significant
physical event that wipes a number of hard drives, writes that are written
stay written. This assumption is sometimes false given certain choices of
filesystem. EC2 just gives us a few more ways for that not to be true.
I think it's more possible than one might expect to have a lagging minority
left behind - say they are partitioned from the majority by a malfunctioning
switch. They might all be lagging already as a result. Care must be taken
not to bring up another follower on the minority side to make it a majority,
else there are split-brain issues as well as the possibility of lost
transactions. Again, not *too* likely to happen in the wild, but these
permanently running services have a nasty habit of exploring the edge
To be explicit, you can cause any ZK cluster to back-track in time by doing
f) add new members of the cluster
Which is why care needs to be taken that the ensemble can't be expanded with
a current quorum. Dynamic membership doesn't save us when a majority fails -
the existence of a quorum is a liveness condition for ZK. To help with the
liveness issue we can sacrifice a little safety (see, e.g. vector clock
ordered timestamps in Dynamo), but I think that ZK is aimed at safety first,
liveness second. Not that you were advocating changing that, I'm just
articulating why correctness is extremely important from my perspective.
At this point, you will have lost the transactions from (b), but I really,
really am not going to worry about this happening either by plan or by
accident. Without steps (e) and (f), the cluster will tell you that it
knows something is wrong and that it cannot elect a leader. If you don't
have *exact* coincidence of the survivor set and the set of laggards, then
you won't have any data loss at all.
You have to decide if this is too much risk for you. My feeling is that it
is OK level of correctness for conventional weapon fire control, but not
nuclear weapons safeguards. Since my apps are considerably less sensitive
than either of those, I am not much worried.
On Mon, Jul 6, 2009 at 12:40 PM, Henry Robinson <he...@cloudera.com>
It seems like there is a
correctness issue: if a majority of servers fail, with the remaining
minority lagging the leader for some reason, won't the ensemble's current
state be forever lost?