Hi, I'm investigating a crash of a ZooKeeper 3.3.4 cluster. It seems that the cause of the crash was an issue in the networking layer. All the ZK server suddenly lost connections to clients as well as all between themselves. Only a few seconds later, all ZooKeeper servers had issues loading their database because of the following exception.
ERROR [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@224] Failed to increment parent cversion for: /a/b/c org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /a/b/c at DataTree.incrementCversion(DataTree.java:1218) at FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:222) at FileTxnSnapLog.restore(FileTxnSnapLog.java:150) at ZKDatabase.loadDataBase(ZKDatabase.java:222) at QuorumPeer.getLastLoggedZxid(QuorumPeer.java:493) at FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:632) at FastLeaderElection.lookForLeader(FastLeaderElection.java:660) at QuorumPeer.run(QuorumPeer.java:622) WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@497] Unable to load database Note that the path "/a/b/c" was different on all servers. Thus, each server tried to restore a different transaction. The only way I was able to bring the cluster back online was to delete all the transaction logs on all servers and start with the latest snapshot. I have all the logs and snapshots available for investigation. Are there any tools to help an investigation? I'd like to find out how such a network outage could possibly cause such an inconsistent/instable state in the system. I noticed a few stability fixes in 3.3.5/3.3.6. Thus, an upgrade is already scheduled. Any help is appreciated. -Gunnar -- Gunnar Wagenknecht [email protected] http://wagenknecht.org/
