You can try running them through org.apache.zookeeper.server.LogFormatter and see what comes out. That's where I would start.
C On Wed, Sep 5, 2012 at 3:43 AM, Gunnar Wagenknecht <[email protected]>wrote: > Hi, > > I'm investigating a crash of a ZooKeeper 3.3.4 cluster. It seems that > the cause of the crash was an issue in the networking layer. All the ZK > server suddenly lost connections to clients as well as all between > themselves. Only a few seconds later, all ZooKeeper servers had issues > loading their database because of the following exception. > > ERROR [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@224] > Failed to increment parent cversion for: /a/b/c > org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = > NoNode for /a/b/c > at DataTree.incrementCversion(DataTree.java:1218) > at FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:222) > at FileTxnSnapLog.restore(FileTxnSnapLog.java:150) > at ZKDatabase.loadDataBase(ZKDatabase.java:222) > at QuorumPeer.getLastLoggedZxid(QuorumPeer.java:493) > at FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:632) > at FastLeaderElection.lookForLeader(FastLeaderElection.java:660) > at QuorumPeer.run(QuorumPeer.java:622) > > WARN [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@497] > Unable to load database > > Note that the path "/a/b/c" was different on all servers. Thus, each > server tried to restore a different transaction. > > The only way I was able to bring the cluster back online was to delete > all the transaction logs on all servers and start with the latest snapshot. > > I have all the logs and snapshots available for investigation. Are there > any tools to help an investigation? I'd like to find out how such a > network outage could possibly cause such an inconsistent/instable state > in the system. I noticed a few stability fixes in 3.3.5/3.3.6. Thus, an > upgrade is already scheduled. > > Any help is appreciated. > > -Gunnar > > > > -- > Gunnar Wagenknecht > [email protected] > http://wagenknecht.org/ > >
