That is one of the great virtues in working with ZK... in the event of a
server failure, you get behavior as good as can be expected.

There are several failure scenarios:

a) a (small) fraction of the ZK servers fail or are cut off, but a quorum
persists

b) a (large) fraction of the ZK servers fail or are cut off and a quorum no
longer exists

c) the network connection to ZK from the machine changing disk status is
interrupted for a short time

d) the machine changing disk status goes down or is disconnected from ZK for
a long period of time.

Failure (a) is not a problem and is, indeeed, a normal maintenance operation
when you are upgrading ZK

Failure (b) is serious and will cause all updates to ZK to stop.  The state
will be preserved if at all possible and when enough ZK machines reappear to
have a quorum, operations will proceed normally.

Failure (c) is generally non-critical, but you should consider how short "a
short time" should be and set your ZK timeouts accordingly.  You have to
deal with this issue in any case to have a reliable system.

Failure (d) is normally handled by using some kind of ephemeral file.  For
instance, you can have one ephemeral file for each machine with a disk.
Then you can have a master process that is notified when such a machine's
ephemeral file disappears.  This master process can do any cleanup
operations necessary.  It is normal to have several master processes of
which only one is active (use ZK for leader election to make this work).

On Wed, Mar 31, 2010 at 1:28 AM, zd.wbh <zd....@163.com> wrote:

>  It is under the assumption that zookeeper requester is stable enough. what
> if a server restart occur in the update sequence, no abort or proceed action
> can be done. I'm just curious how to handle this kinds of dirty data.
>

Reply via email to