On Jul 6, 2009, at 15:40 , Henry Robinson wrote:
This is an interesting way of doing things. It seems like there is a
correctness issue: if a majority of servers fail, with the remaining
minority lagging the leader for some reason, won't the ensemble's
state be forever lost? This is akin to a majority of servers failing
never recovering. ZK relies on the eventual liveness of a majority
servers; with EC2 it seems possible that that property might not be
I think you are absolutely correct. However, my understanding of EC2
failure modes is that even though there is no guarantee that a
particular instance's disk will survive a failure, it is very possible
to observe EC2 nodes that "fail" temporarily (such as rebooting). In
these cases, the instance's disk typically does survive, and when it
comes back it will have the same contents. It is only "permanent" EC2
failures where the disk is gone (eg. hardware failure, or Amazon
decides to pull it for some other reason).
Thus, this looks a lot like running your own machines in your own data
center to me. Soft failures will recover, hardware failures won't. The
only difference is that if you were running the machines yourself, and
you ran into some weird issue where you had hardware failures across a
majority of your Zookeeper ensemble, you could physically move the
disks to recover the state. If this happens in EC2, you will have to
do some sort of "manual" repair where you forcibly restart Zookeeper
using the state of one of the surviving members. Some Zookeeper
operations may be lost in this case.
However, we are talking about a situation that seems exceedingly rare.
No matter what kind of system you are running, serious non-recoverable
failures will happen, so I don't see this to be an impediment for
running Zookeeper or other quorum systems in EC2.
That said, I haven't run enough EC2 instances for a long enough period
of time to observe any serious failures or recoveries. If anyone has
more detailed information, I would love to hear about it.