I've moved this thread to: https://issues.apache.org/jira/browse/ZOOKEEPER-801
--travis On Thu, Jul 1, 2010 at 12:37 AM, Patrick Hunt <ph...@apache.org> wrote: > Hi Travis, as Flavio suggested would be great to get the logs. A few > questions: > > 1) how did you eventually recover, restart the zk servers? > > 2) was the cluster losing quorum during this time? leader re-election? > > 3) Any chance this could have been initially triggered by a long GC pause on > one of the servers? (is gc logging turned on, any sort of heap monitoring?) > Has the GC been tuned on the servers, for example CMS and incremental? > > 4) what are the clients using for timeout on the sessions? > > 3.4 probably not for a few months yet, but we are planning for a 3.3.2 in a > few weeks to fix a couple critical issues (which don't seem related to what > you saw). If we can identify the problem here we should be able to include > it in any fix release we do. > > fixing something like 517 might help, but it's not clear how we got to this > state in the first place. fixing 517 might not have any effect if the root > cause is not addressed. 662 has only ever been reported once afaik, and we > weren't able to identify the root cause for that one. > > One thing we might also consider is modifying the zk client lib to backoff > connection attempts if they keep failing (timing out say). Today the clients > are pretty aggressive on reconnection attempts. Having some sort of backoff > (exponential?) would provide more breathing room to the server in this > situation. > > Patrick > > On 06/30/2010 11:13 PM, Travis Crawford wrote: >> >> Hey zookeepers - >> >> We just experienced a total zookeeper outage, and here's a quick >> post-mortem of the issue, and some questions about preventing it going >> forward. Quick overview of the setup: >> >> - RHEL5 2.6.18 kernel >> - Zookeeper 3.3.0 >> - ulimit raised to 65k files >> - 3 cluster members >> - 4-5k connections in steady-state >> - Primarily C and python clients, plus some java >> >> In chronological order, the issue manifested itself as alert about RW >> tests failing. Logs were full of too many files errors, and the output >> of netstat showed lots of CLOSE_WAIT and SYN_RECV sockets. CPU was >> 100%. Application logs showed lots of connection timeouts. This >> suggests an event happened that caused applications to dogpile on >> Zookeeper, and eventually the CLOSE_WAIT timeout caused file handles >> to run out and basically game over. >> >> I looked through lots of logs (clients+servers) and did not see a >> clear indication of what happened. Graphs show a sudden decrease in >> network traffic when the outage began, zookeeper goes cpu bound, and >> runs our of file descriptors. >> >> Clients are primarily a couple thousand C clients using default >> connection parameters, and a couple thousand python clients using >> default connection parameters. >> >> Digging through Jira we see two issues that probably contributed to this >> outage: >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-662 >> https://issues.apache.org/jira/browse/ZOOKEEPER-517 >> >> Both are tagged for the 3.4.0 release. Anyone know if that's still the >> case, and when 3.4.0 is roughly scheduled to ship? >> >> Thanks! >> Travis >