Closing the loop on this, It appears that upping the initLimit did resolve the issue. Thanks all for the help.
Thanks, Aaron Zimmerman On Tue, Jul 8, 2014 at 4:40 PM, Flavio Junqueira < [email protected]> wrote: > Agreed, but we need that check because we expect bytes for the checksum > computation right underneath. The bit that's odd is that we make the same > check again below: > > try { > long crcValue = ia.readLong("crcvalue"); > byte[] bytes = Util.readTxnBytes(ia); > // Since we preallocate, we define EOF to be an > if (bytes == null || bytes.length==0) { > throw new EOFException("Failed to read " + logFile); > } > // EOF or corrupted record > // validate CRC > Checksum crc = makeChecksumAlgorithm(); > crc.update(bytes, 0, bytes.length); > if (crcValue != crc.getValue()) > throw new IOException(CRC_ERROR); > if (bytes == null || bytes.length == 0) > return false; > hdr = new TxnHeader(); > record = SerializeUtils.deserializeTxn(bytes, hdr); > } catch (EOFException e) { > > I'm moving this discussion, to the jira, btw. > > -Flavio > > On 07 Jul 2014, at 22:03, Aaron Zimmerman <[email protected]> > wrote: > > > Flavio, > > > > Yes that is the initial error, and then the nodes in the cluster are > > restarted but fail to restart with > > > > 2014-07-04 12:58:52,734 [myid:1] - INFO [main:FileSnap@83] - Reading > > snapshot /var/lib/zookeeper/version-2/snapshot.300011fc0 > > 2014-07-04 12:58:52,896 [myid:1] - DEBUG > > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream > > /var/lib/zookeeper/version-2/log.300000021 > > 2014-07-04 12:58:52,915 [myid:1] - DEBUG > > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive > > /var/lib/zookeeper/version-2/log.300000021 > > 2014-07-04 12:59:25,870 [myid:1] - DEBUG > > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton > java.io.EOFException: > > Failed to read /var/lib/zookeeper/version-2/log.300000021 > > 2014-07-04 12:59:25,871 [myid:1] - DEBUG > > [main:FileTxnLog$FileTxnIterator@575] - Created new input stream > > /var/lib/zookeeper/version-2/log.300011fc2 > > 2014-07-04 12:59:25,872 [myid:1] - DEBUG > > [main:FileTxnLog$FileTxnIterator@578] - Created new input archive > > /var/lib/zookeeper/version-2/log.300011fc2 > > 2014-07-04 12:59:48,722 [myid:1] - DEBUG > > [main:FileTxnLog$FileTxnIterator@618] - EOF excepton > java.io.EOFException: > > Failed to read /var/lib/zookeeper/version-2/log.300011fc2 > > > > Thanks, > > > > AZ > > > > > > On Mon, Jul 7, 2014 at 3:33 PM, Flavio Junqueira < > > [email protected]> wrote: > > > >> I'm a bit confused, the stack trace you reported was this one: > >> > >> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > >> following the leader > >> java.io.EOFException > >> at java.io.DataInputStream.readInt(DataInputStream.java:375) > >> at > >> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > >> at > >> > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) > >> at > >> > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) > >> at > >> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152) > >> at > >> > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > >> at > >> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740) > >> > >> > >> That's in a different part of the code. > >> > >> -Flavio > >> > >> On 07 Jul 2014, at 18:50, Aaron Zimmerman <[email protected]> > >> wrote: > >> > >>> Util.readTxnBytes reads from the buffer and if the length is 0, it > return > >>> the zero length array, seemingly indicating the end of the file. > >>> > >>> Then this is detected in FileTxnLog.java:671: > >>> > >>> byte[] bytes = Util.readTxnBytes(ia); > >>> // Since we preallocate, we define EOF to be an > >>> if (bytes == null || bytes.length==0) { > >>> throw new EOFException("Failed to read " + logFile); > >>> } > >>> > >>> > >>> This exception is caught a few lines later, and the streams closed etc. > >>> > >>> So this seems to be not really an error condition, but a signal that > the > >>> entire file has been read? Is this exception a red herring? > >>> > >>> > >>> > >>> > >>> On Mon, Jul 7, 2014 at 11:50 AM, Raúl Gutiérrez Segalés < > >> [email protected] > >>>> wrote: > >>> > >>>> On 7 July 2014 09:39, Aaron Zimmerman <[email protected]> > >> wrote: > >>>> > >>>>> What I don't understand is how the entire cluster could die in such a > >>>>> situation. I was able to load zookeeper locally using the snapshot > and > >>>> 10g > >>>>> log file without apparent issue. > >>>> > >>>> > >>>> Sure, but it's syncing up with other learners that becomes challenging > >> when > >>>> having either big snapshots or too many txnlogs, right? > >>>> > >>>> > >>>>> I can see how large amounts of data could > >>>>> cause latency issues in syncing causing a single worker to die, but > how > >>>>> would that explain the node's inability to restart? When the server > >>>>> replays the log file, does it have to sync the transactions to other > >>>> nodes > >>>>> while it does so? > >>>>> > >>>> > >>>> Given that your txn churn is so big, by the time it finished up > reading > >>>> from disc it'll need > >>>> to catch up with the quorum.. how many txns have happened by that > >> point? By > >>>> the way, we use > >>>> this patch: > >>>> > >>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1804 > >>>> > >>>> to measure transaction rate, do you have any approximation of what > your > >>>> transaction rate might be? > >>>> > >>>> > >>>>> > >>>>> I can alter the settings as has been discussed, but I worry that I'm > >> just > >>>>> delaying the same thing from happening again, if I deploy another > storm > >>>>> topology or something. How can I get the cluster in a state where I > >> can > >>>> be > >>>>> confident that it won't crash in a similar way as load increases, or > at > >>>>> least set up some kind of monitoring that will let me know something > is > >>>>> unhealthy? > >>>>> > >>>> > >>>> I think it depends on what your txn rate is, lets measure that first I > >>>> guess. > >>>> > >>>> > >>>> -rgs > >>>> > >> > >> > >
