Re: entire cluster dies with EOFException

Raúl Gutiérrez Segalés Sun, 06 Jul 2014 11:10:25 -0700

Oh, storm right. Yeah I've seen this. The transaction rate is so huge the
the initial sync fails.. perhaps you could try bigger tickTime, initLimit
and syncLimit params...



-rgs


On 6 July 2014 04:48, Aaron Zimmerman <[email protected]> wrote:

> Raúl,
>
> zk_approximate_data_size 4899392
>
> That is about the size of the snapshots also.
>
> Benjamin,
>
> We are not running out of disk space.
> But the log.XXXX files are quite large, is this normal?  In less than 3
> hours, the log file since the last snapshot is 8.2G, and the older log
> files are as large as 12G.
>
> We are using Storm Trident, this uses zookeeper pretty heavily for tracking
> transactional state, but i'm not sure if that could account for this much
> storage.  Is there an easy way to track which znodes are being updated most
> frequently?
>
> Thanks,
>
> Aaron
>
>
>
>
>
> On Sun, Jul 6, 2014 at 1:55 AM, Raúl Gutiérrez Segalés <
> [email protected]>
> wrote:
>
> > What's the total size of the data in your ZK cluster? i.e.:
> >
> > $ echo mntr | nc localhost 2181 | grep zk_approximate_data_size
> >
> > And/or the size of the snapshot?
> >
> >
> > -rgs
> >
> >
> > On 4 July 2014 06:29, Aaron Zimmerman <[email protected]>
> wrote:
> >
> > > Hi all,
> > >
> > > We have a 5 node zookeeper cluster that has been operating normally for
> > > several months.  Starting a few days ago, the entire cluster crashes a
> > few
> > > times per day, all nodes at the exact same time.  We can't track down
> the
> > > exact issue, but deleting the snapshots and logs and restarting
> resolves.
> > >
> > > We are running exhibitor to monitor the cluster.
> > >
> > > It appears that something bad gets into the logs, causing an
> EOFException
> > > and this cascades through the entire cluster:
> > >
> > > 2014-07-04 12:55:26,328 [myid:1] - WARN
> > >  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception
> when
> > > following the leader
> > > java.io.EOFException
> > >         at java.io.DataInputStream.readInt(DataInputStream.java:375)
> > >         at
> > > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
> > >         at
> > >
> > >
> >
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
> > >         at
> > >
> >
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
> > >         at
> > > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
> > >         at
> > >
> >
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
> > >         at
> > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
> > > 2014-07-04 12:55:26,328 [myid:1] - INFO
> > >  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown
> > called
> > > java.lang.Exception: shutdown Follower
> > >         at
> > > org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> > >         at
> > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:744)
> > >
> > >
> > > Then the server dies, exhibitor tries to restart each node, and they
> all
> > > get stuck trying to replay the bad transaction, logging things like:
> > >
> > >
> > > 2014-07-04 12:58:52,734 [myid:1] - INFO  [main:FileSnap@83] - Reading
> > > snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0
> > > 2014-07-04 12:58:52,896 [myid:1] - DEBUG
> > > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
> > > /var/lib/zookeeper/version-2/log.300000021
> > > 2014-07-04 12:58:52,915 [myid:1] - DEBUG
> > > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
> > > /var/lib/zookeeper/version-2/log.300000021
> > > 2014-07-04 12:59:25,870 [myid:1] - DEBUG
> > > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
> > java.io.EOFException:
> > > Failed to read /var/lib/zookeeper/version-2/log.300000021
> > > 2014-07-04 12:59:25,871 [myid:1] - DEBUG
> > > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream
> > > /var/lib/zookeeper/version-2/log.300011fc2
> > > 2014-07-04 12:59:25,872 [myid:1] - DEBUG
> > > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive
> > > /var/lib/zookeeper/version-2/log.300011fc2
> > > 2014-07-04 12:59:48,722 [myid:1] - DEBUG
> > > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton
> > java.io.EOFException:
> > > Failed to read /var/lib/zookeeper/version-2/log.300011fc2
> > >
> > > And the cluster is dead.  The only way we have found to recover is to
> > > delete all of the data and restart.
> > >
> > > Anyone seen this before?  Any ideas how I can track down what is
> causing
> > > the EOFException, or insulate zookeeper from completely crashing?
> > >
> > > Thanks,
> > >
> > > Aaron Zimmerman
> > >
> >
>

Re: entire cluster dies with EOFException

Reply via email to