Re: entire cluster dies with EOFException

Aaron Zimmerman Mon, 07 Jul 2014 11:12:01 -0700

Util.readTxnBytes reads from the buffer and if the length is 0, it return
the zero length array, seemingly indicating the end of the file.


Then this is detected in FileTxnLog.java:671:

                byte[] bytes = Util.readTxnBytes(ia);
                // Since we preallocate, we define EOF to be an
                if (bytes == null || bytes.length==0) {
                    throw new EOFException("Failed to read " + logFile);
                }


This exception is caught a few lines later, and the streams closed etc.

So this seems to be not really an error condition, but a signal that the
entire file has been read? Is this exception a red herring?




On Mon, Jul 7, 2014 at 11:50 AM, Raúl Gutiérrez Segalés <[email protected]
> wrote:

> On 7 July 2014 09:39, Aaron Zimmerman <[email protected]> wrote:
>
> > What I don't understand is how the entire cluster could die in such a
> > situation.  I was able to load zookeeper locally using the snapshot and
> 10g
> > log file without apparent issue.
>
>
> Sure, but it's syncing up with other learners that becomes challenging when
> having either big snapshots or too many txnlogs, right?
>
>
> >  I can see how large amounts of data could
> > cause latency issues in syncing causing a single worker to die, but how
> > would that explain the node's inability to restart?  When the server
> > replays the log file, does it have to sync the transactions to other
> nodes
> > while it does so?
> >
>
> Given that your txn churn is so big, by the time it finished up reading
> from disc it'll need
> to catch up with the quorum.. how many txns have happened by that point? By
> the way, we use
> this patch:
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-1804
>
> to measure transaction rate, do you have any approximation of what your
> transaction rate might be?
>
>
> >
> > I can alter the settings as has been discussed, but I worry that I'm just
> > delaying the same thing from happening again, if I deploy another storm
> > topology or something.  How can I get the cluster in a state where I can
> be
> > confident that it won't crash in a similar way as load increases, or at
> > least set up some kind of monitoring that will let me know something is
> > unhealthy?
> >
>
> I think it depends on what your txn rate is, lets measure that first I
> guess.
>
>
> -rgs
>

Re: entire cluster dies with EOFException

Reply via email to