That is one of the great virtues in working with ZK... in the event of a server failure, you get behavior as good as can be expected.
There are several failure scenarios: a) a (small) fraction of the ZK servers fail or are cut off, but a quorum persists b) a (large) fraction of the ZK servers fail or are cut off and a quorum no longer exists c) the network connection to ZK from the machine changing disk status is interrupted for a short time d) the machine changing disk status goes down or is disconnected from ZK for a long period of time. Failure (a) is not a problem and is, indeeed, a normal maintenance operation when you are upgrading ZK Failure (b) is serious and will cause all updates to ZK to stop. The state will be preserved if at all possible and when enough ZK machines reappear to have a quorum, operations will proceed normally. Failure (c) is generally non-critical, but you should consider how short "a short time" should be and set your ZK timeouts accordingly. You have to deal with this issue in any case to have a reliable system. Failure (d) is normally handled by using some kind of ephemeral file. For instance, you can have one ephemeral file for each machine with a disk. Then you can have a master process that is notified when such a machine's ephemeral file disappears. This master process can do any cleanup operations necessary. It is normal to have several master processes of which only one is active (use ZK for leader election to make this work). On Wed, Mar 31, 2010 at 1:28 AM, zd.wbh <zd....@163.com> wrote: > It is under the assumption that zookeeper requester is stable enough. what > if a server restart occur in the update sequence, no abort or proceed action > can be done. I'm just curious how to handle this kinds of dirty data. >