I've moved this thread to:
On Thu, Jul 1, 2010 at 12:37 AM, Patrick Hunt <ph...@apache.org> wrote:
> Hi Travis, as Flavio suggested would be great to get the logs. A few
> 1) how did you eventually recover, restart the zk servers?
> 2) was the cluster losing quorum during this time? leader re-election?
> 3) Any chance this could have been initially triggered by a long GC pause on
> one of the servers? (is gc logging turned on, any sort of heap monitoring?)
> Has the GC been tuned on the servers, for example CMS and incremental?
> 4) what are the clients using for timeout on the sessions?
> 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a
> few weeks to fix a couple critical issues (which don't seem related to what
> you saw). If we can identify the problem here we should be able to include
> it in any fix release we do.
> fixing something like 517 might help, but it's not clear how we got to this
> state in the first place. fixing 517 might not have any effect if the root
> cause is not addressed. 662 has only ever been reported once afaik, and we
> weren't able to identify the root cause for that one.
> One thing we might also consider is modifying the zk client lib to backoff
> connection attempts if they keep failing (timing out say). Today the clients
> are pretty aggressive on reconnection attempts. Having some sort of backoff
> (exponential?) would provide more breathing room to the server in this
> On 06/30/2010 11:13 PM, Travis Crawford wrote:
>> Hey zookeepers -
>> We just experienced a total zookeeper outage, and here's a quick
>> post-mortem of the issue, and some questions about preventing it going
>> forward. Quick overview of the setup:
>> - RHEL5 2.6.18 kernel
>> - Zookeeper 3.3.0
>> - ulimit raised to 65k files
>> - 3 cluster members
>> - 4-5k connections in steady-state
>> - Primarily C and python clients, plus some java
>> In chronological order, the issue manifested itself as alert about RW
>> tests failing. Logs were full of too many files errors, and the output
>> of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was
>> 100%. Application logs showed lots of connection timeouts. This
>> suggests an event happened that caused applications to dogpile on
>> Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles
>> to run out and basically game over.
>> I looked through lots of logs (clients+servers) and did not see a
>> clear indication of what happened. Graphs show a sudden decrease in
>> network traffic when the outage began, zookeeper goes cpu bound, and
>> runs our of file descriptors.
>> Clients are primarily a couple thousand C clients using default
>> connection parameters, and a couple thousand python clients using
>> default connection parameters.
>> Digging through Jira we see two issues that probably contributed to this
>> Both are tagged for the 3.4.0 release. Anyone know if that's still the
>> case, and when 3.4.0 is roughly scheduled to ship?