Re: entire cluster dies with EOFException

Raúl Gutiérrez Segalés Mon, 07 Jul 2014 09:52:24 -0700

well, not transaction rate but transaction count.. you can get the rate out
of that :-D



-rgs


On 7 July 2014 09:50, Raúl Gutiérrez Segalés <[email protected]> wrote:

> On 7 July 2014 09:39, Aaron Zimmerman <[email protected]> wrote:
>
>> What I don't understand is how the entire cluster could die in such a
>> situation.  I was able to load zookeeper locally using the snapshot and
>> 10g
>> log file without apparent issue.
>
>
> Sure, but it's syncing up with other learners that becomes challenging
> when having either big snapshots or too many txnlogs, right?
>
>
>>  I can see how large amounts of data could
>> cause latency issues in syncing causing a single worker to die, but how
>> would that explain the node's inability to restart?  When the server
>> replays the log file, does it have to sync the transactions to other nodes
>> while it does so?
>>
>
> Given that your txn churn is so big, by the time it finished up reading
> from disc it'll need
> to catch up with the quorum.. how many txns have happened by that point?
> By the way, we use
> this patch:
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-1804
>
> to measure transaction rate, do you have any approximation of what your
> transaction rate might be?
>
>
>>
>> I can alter the settings as has been discussed, but I worry that I'm just
>> delaying the same thing from happening again, if I deploy another storm
>> topology or something.  How can I get the cluster in a state where I can
>> be
>> confident that it won't crash in a similar way as load increases, or at
>> least set up some kind of monitoring that will let me know something is
>> unhealthy?
>>
>
> I think it depends on what your txn rate is, lets measure that first I
> guess.
>
>
> -rgs
>

Re: entire cluster dies with EOFException

Reply via email to