Re: entire cluster dies with EOFException

Flavio Junqueira Mon, 07 Jul 2014 13:34:33 -0700

I'm a bit confused, the stack trace you reported was this one:

[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader
java.io.EOFException
       at java.io.DataInputStream.readInt(DataInputStream.java:375)
       at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
       at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
       at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
       at 
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
       at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
       at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)



That's in a different part of the code.

-Flavio

On 07 Jul 2014, at 18:50, Aaron Zimmerman <[email protected]> wrote:

> Util.readTxnBytes reads from the buffer and if the length is 0, it return
> the zero length array, seemingly indicating the end of the file.
> 
> Then this is detected in FileTxnLog.java:671:
> 
>                byte[] bytes = Util.readTxnBytes(ia);
>                // Since we preallocate, we define EOF to be an
>                if (bytes == null || bytes.length==0) {
>                    throw new EOFException("Failed to read " + logFile);
>                }
> 
> 
> This exception is caught a few lines later, and the streams closed etc.
> 
> So this seems to be not really an error condition, but a signal that the
> entire file has been read? Is this exception a red herring?
> 
> 
> 
> 
> On Mon, Jul 7, 2014 at 11:50 AM, Raúl Gutiérrez Segalés <[email protected]
>> wrote:
> 
>> On 7 July 2014 09:39, Aaron Zimmerman <[email protected]> wrote:
>> 
>>> What I don't understand is how the entire cluster could die in such a
>>> situation.  I was able to load zookeeper locally using the snapshot and
>> 10g
>>> log file without apparent issue.
>> 
>> 
>> Sure, but it's syncing up with other learners that becomes challenging when
>> having either big snapshots or too many txnlogs, right?
>> 
>> 
>>> I can see how large amounts of data could
>>> cause latency issues in syncing causing a single worker to die, but how
>>> would that explain the node's inability to restart?  When the server
>>> replays the log file, does it have to sync the transactions to other
>> nodes
>>> while it does so?
>>> 
>> 
>> Given that your txn churn is so big, by the time it finished up reading
>> from disc it'll need
>> to catch up with the quorum.. how many txns have happened by that point? By
>> the way, we use
>> this patch:
>> 
>> https://issues.apache.org/jira/browse/ZOOKEEPER-1804
>> 
>> to measure transaction rate, do you have any approximation of what your
>> transaction rate might be?
>> 
>> 
>>> 
>>> I can alter the settings as has been discussed, but I worry that I'm just
>>> delaying the same thing from happening again, if I deploy another storm
>>> topology or something.  How can I get the cluster in a state where I can
>> be
>>> confident that it won't crash in a similar way as load increases, or at
>>> least set up some kind of monitoring that will let me know something is
>>> unhealthy?
>>> 
>> 
>> I think it depends on what your txn rate is, lets measure that first I
>> guess.
>> 
>> 
>> -rgs
>>

Re: entire cluster dies with EOFException

Reply via email to