Oh, storm right. Yeah I've seen this. The transaction rate is so huge the the initial sync fails.. perhaps you could try bigger tickTime, initLimit and syncLimit params...
-rgs On 6 July 2014 04:48, Aaron Zimmerman <[email protected]> wrote: > Raúl, > > zk_approximate_data_size 4899392 > > That is about the size of the snapshots also. > > Benjamin, > > We are not running out of disk space. > But the log.XXXX files are quite large, is this normal? In less than 3 > hours, the log file since the last snapshot is 8.2G, and the older log > files are as large as 12G. > > We are using Storm Trident, this uses zookeeper pretty heavily for tracking > transactional state, but i'm not sure if that could account for this much > storage. Is there an easy way to track which znodes are being updated most > frequently? > > Thanks, > > Aaron > > > > > > On Sun, Jul 6, 2014 at 1:55 AM, Raúl Gutiérrez Segalés < > [email protected]> > wrote: > > > What's the total size of the data in your ZK cluster? i.e.: > > > > $ echo mntr | nc localhost 2181 | grep zk_approximate_data_size > > > > And/or the size of the snapshot? > > > > > > -rgs > > > > > > On 4 July 2014 06:29, Aaron Zimmerman <[email protected]> > wrote: > > > > > Hi all, > > > > > > We have a 5 node zookeeper cluster that has been operating normally for > > > several months. Starting a few days ago, the entire cluster crashes a > > few > > > times per day, all nodes at the exact same time. We can't track down > the > > > exact issue, but deleting the snapshots and logs and restarting > resolves. > > > > > > We are running exhibitor to monitor the cluster. > > > > > > It appears that something bad gets into the logs, causing an > EOFException > > > and this cascades through the entire cluster: > > > > > > 2014-07-04 12:55:26,328 [myid:1] - WARN > > > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception > when > > > following the leader > > > java.io.EOFException > > > at java.io.DataInputStream.readInt(DataInputStream.java:375) > > > at > > > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > > > at > > > > > > > > > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > > > at > > > > > > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) > > > at > > > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152) > > > at > > > > > > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > > > at > > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740) > > > 2014-07-04 12:55:26,328 [myid:1] - INFO > > > [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown > > called > > > java.lang.Exception: shutdown Follower > > > at > > > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) > > > at > > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744) > > > > > > > > > Then the server dies, exhibitor tries to restart each node, and they > all > > > get stuck trying to replay the bad transaction, logging things like: > > > > > > > > > 2014-07-04 12:58:52,734 [myid:1] - INFO [main:FileSnap@83] - Reading > > > snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0 > > > 2014-07-04 12:58:52,896 [myid:1] - DEBUG > > > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream > > > /var/lib/zookeeper/version-2/log.300000021 > > > 2014-07-04 12:58:52,915 [myid:1] - DEBUG > > > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive > > > /var/lib/zookeeper/version-2/log.300000021 > > > 2014-07-04 12:59:25,870 [myid:1] - DEBUG > > > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton > > java.io.EOFException: > > > Failed to read /var/lib/zookeeper/version-2/log.300000021 > > > 2014-07-04 12:59:25,871 [myid:1] - DEBUG > > > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream > > > /var/lib/zookeeper/version-2/log.300011fc2 > > > 2014-07-04 12:59:25,872 [myid:1] - DEBUG > > > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive > > > /var/lib/zookeeper/version-2/log.300011fc2 > > > 2014-07-04 12:59:48,722 [myid:1] - DEBUG > > > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton > > java.io.EOFException: > > > Failed to read /var/lib/zookeeper/version-2/log.300011fc2 > > > > > > And the cluster is dead. The only way we have found to recover is to > > > delete all of the data and restart. > > > > > > Anyone seen this before? Any ideas how I can track down what is > causing > > > the EOFException, or insulate zookeeper from completely crashing? > > > > > > Thanks, > > > > > > Aaron Zimmerman > > > > > >
