I was just asked a very cogent question of the form "how do you know" and would like somebody who knows better than I do to confirm or deny my response. The only part that I am absolutely sure of is the part at the end where I say "No doubt I have omitted something". With an edit from Ben, this probably should become a wiki page.
Here is the conversation. Please mark my errors or elisions if you can. You have to educate me in how ZK does data-integrity checking to avoid > propagating accidental data-corruption (eg, from an ext3 bug, faulty drive, > etc). We might have to augment ZK to add that ability if it doesn't already. > There are several mechanisms at the application API layer, in ZK internals and in the snapshot and transaction log formats. At the application layer, all updates are atomic and all update the entire contents. If you provide a version number with the update instead of -1, then your update will only succeed if the version being updated matches that version. All updates are strictly serialized so there are no race conditions on updates. This makes lots of things simpler inside of ZK. All updates go through the ZK master and are only committed when replicated to a quorum of the cluster. The quorum is ceil((n+1)/2) where n is the configured number of servers. Committed means that the update has been flushed to the transaction log on disk. The replication of updates is from master memory to slave memory rather than master memory to disk to slave. All logs and snapshots have application level CRC's and don't depend on disk ECC for correctness. At cluster start, logs are examined to determine the latest transaction that was committed correctly. At least a quorum must have the latest update because of the update semantics and the strong CRC's prevent a partially written transaction from being read as being acceptable. I don't know how the last update id is stored exactly. Reads can give stale data for short periods of time, but will always give a coherently updated picture of what the master knew at some point in the past. If a client connection is not lost, then the clients view of the last update id is monotonic increasing. If a client loses a connection and reconnects it is conceivable that after reconnecting, they will see an earlier version of the universe, but this is exceedingly unlikely because the time required to reconnect is typically longer than how far out of date any ZK cluster member can typically be. You can cause this if you connect to a ZK cluster member, partition the cluster so that your connected server is in the majority, update and then connect to a cluster member that is separated from the master of the ZK cluster. You can always do a sync for force the server you are using to catch up to the master (at least for the moment). The use of sync will force a monotonic view of time regardless of any connection/disconnection/reconnection scenario. Since all updates must go through the cluster master, updates cannot happen in the cluster split brain scenarios and the strong serialization guarantees for all updates are maintained. The only data loss scenarios I have heard of in about 3 years of watching all ZK mailing list traffic and all bug reports have had to do with the theoretical possibility of problems due to disk controllers that lie about when data is persisted. I have seen cluster failures where memory is exhausted, but I can't remember any that caused loss of data. There was one bug where somehow a cluster member remembered a transaction id that was one past the last transaction that had been committed to the logs. This was a very strange coincidence of very aggressive operator error and a bug which has been fixed. The cluster refused to restart in this case, but the fix was simply to delete the log on the confused machine and restart the cluster. No data was lost. I have, no doubt, omitted something.