On 6 July 2014 14:26, Flavio Junqueira <[email protected]> wrote:
> But what is it that was causing problems in your scenario, Raul? Is it > reading the log? In any case, it sounds like initLimit is the parameter you > want to change, no? > Yeah, I think so. It was just that it took too long to walk through all the txns (too many of them). So finding the sweet spot of snapshots vs transactions is a bit tricky in this case I think. -rgs > > -Flavio > > On 06 Jul 2014, at 19:09, Raúl Gutiérrez Segalés <[email protected]> > wrote: > > > Oh, storm right. Yeah I've seen this. The transaction rate is so huge the > > the initial sync fails.. perhaps you could try bigger tickTime, initLimit > > and syncLimit params... > > > > > > -rgs > > > > > > On 6 July 2014 04:48, Aaron Zimmerman <[email protected]> > wrote: > > > >> Raúl, > >> > >> zk_approximate_data_size 4899392 > >> > >> That is about the size of the snapshots also. > >> > >> Benjamin, > >> > >> We are not running out of disk space. > >> But the log.XXXX files are quite large, is this normal? In less than 3 > >> hours, the log file since the last snapshot is 8.2G, and the older log > >> files are as large as 12G. > >> > >> We are using Storm Trident, this uses zookeeper pretty heavily for > tracking > >> transactional state, but i'm not sure if that could account for this > much > >> storage. Is there an easy way to track which znodes are being updated > most > >> frequently? > >> > >> Thanks, > >> > >> Aaron > >> > >> > >> > >> > >> > >> On Sun, Jul 6, 2014 at 1:55 AM, Raúl Gutiérrez Segalés < > >> [email protected]> > >> wrote: > >> > >>> What's the total size of the data in your ZK cluster? i.e.: > >>> > >>> $ echo mntr | nc localhost 2181 | grep zk_approximate_data_size > >>> > >>> And/or the size of the snapshot? > >>> > >>> > >>> -rgs > >>> > >>> > >>> On 4 July 2014 06:29, Aaron Zimmerman <[email protected]> > >> wrote: > >>> > >>>> Hi all, > >>>> > >>>> We have a 5 node zookeeper cluster that has been operating normally > for > >>>> several months. Starting a few days ago, the entire cluster crashes a > >>> few > >>>> times per day, all nodes at the exact same time. We can't track down > >> the > >>>> exact issue, but deleting the snapshots and logs and restarting > >> resolves. > >>>> > >>>> We are running exhibitor to monitor the cluster. > >>>> > >>>> It appears that something bad gets into the logs, causing an > >> EOFException > >>>> and this cascades through the entire cluster: > >>>> > >>>> 2014-07-04 12:55:26,328 [myid:1] - WARN > >>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception > >> when > >>>> following the leader > >>>> java.io.EOFException > >>>> at java.io.DataInputStream.readInt(DataInputStream.java:375) > >>>> at > >>>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > >>>> at > >>>> > >>>> > >>> > >> > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > >>>> at > >>>> > >>> > >> > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) > >>>> at > >>>> > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152) > >>>> at > >>>> > >>> > >> > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > >>>> at > >>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740) > >>>> 2014-07-04 12:55:26,328 [myid:1] - INFO > >>>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown > >>> called > >>>> java.lang.Exception: shutdown Follower > >>>> at > >>>> > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) > >>>> at > >>>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744) > >>>> > >>>> > >>>> Then the server dies, exhibitor tries to restart each node, and they > >> all > >>>> get stuck trying to replay the bad transaction, logging things like: > >>>> > >>>> > >>>> 2014-07-04 12:58:52,734 [myid:1] - INFO [main:FileSnap@83] - Reading > >>>> snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0 > >>>> 2014-07-04 12:58:52,896 [myid:1] - DEBUG > >>>> [main:FileTxnLog$FileTxnIterator@575] - Created new input stream > >>>> /var/lib/zookeeper/version-2/log.300000021 > >>>> 2014-07-04 12:58:52,915 [myid:1] - DEBUG > >>>> [main:FileTxnLog$FileTxnIterator@578] - Created new input archive > >>>> /var/lib/zookeeper/version-2/log.300000021 > >>>> 2014-07-04 12:59:25,870 [myid:1] - DEBUG > >>>> [main:FileTxnLog$FileTxnIterator@618] - EOF excepton > >>> java.io.EOFException: > >>>> Failed to read /var/lib/zookeeper/version-2/log.300000021 > >>>> 2014-07-04 12:59:25,871 [myid:1] - DEBUG > >>>> [main:FileTxnLog$FileTxnIterator@575] - Created new input stream > >>>> /var/lib/zookeeper/version-2/log.300011fc2 > >>>> 2014-07-04 12:59:25,872 [myid:1] - DEBUG > >>>> [main:FileTxnLog$FileTxnIterator@578] - Created new input archive > >>>> /var/lib/zookeeper/version-2/log.300011fc2 > >>>> 2014-07-04 12:59:48,722 [myid:1] - DEBUG > >>>> [main:FileTxnLog$FileTxnIterator@618] - EOF excepton > >>> java.io.EOFException: > >>>> Failed to read /var/lib/zookeeper/version-2/log.300011fc2 > >>>> > >>>> And the cluster is dead. The only way we have found to recover is to > >>>> delete all of the data and restart. > >>>> > >>>> Anyone seen this before? Any ideas how I can track down what is > >> causing > >>>> the EOFException, or insulate zookeeper from completely crashing? > >>>> > >>>> Thanks, > >>>> > >>>> Aaron Zimmerman > >>>> > >>> > >> > >
