Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Ted Dunning
Yes. That is roughly what I mean. If one server starts a GC, it can effectively go offline. That might pressure the other servers enough that one of them starts a GC. This is unlikely with your GC settings, but you should turn on the verbose GC logging to be sure. On Wed, May 12, 2010 at 10:09

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Patrick Hunt
On 05/12/2010 08:30 PM, Aaron Crow wrote: I may have a better idea of what caused the trouble. I way, WAY underestimated the number of nodes we collect over time. Right now we're at 1.9 million. This isn't a bug of our application; it's actually a feature (but perhaps an ill-conceived one). A m

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Aaron Crow
Hi Ted, yeah it's a big number, eh? We're essentially using Zookeeper to track the state of cache entries, and currently we don't bound our cache. I didn't realize how many entries we grow to over a long period of time, until I started counting nodes in Zookeeper. But, sorry, I'm not sure what you

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Ted Dunning
Impressive number here, especially at your quoted "few per second" rate. Are you sure that you haven't inadvertently synchronized GC on multiple machines? On Wed, May 12, 2010 at 8:30 PM, Aaron Crow wrote: > Right now we're at > 1.9 million. This isn't a bug of our application; it's actually a

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Aaron Crow
I may have a better idea of what caused the trouble. I way, WAY underestimated the number of nodes we collect over time. Right now we're at 1.9 million. This isn't a bug of our application; it's actually a feature (but perhaps an ill-conceived one). A most recent snapshot from a Zookeeper db is 22

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-04-30 Thread Patrick Hunt
On 04/30/2010 10:16 AM, Aaron Crow wrote: Hi Patrick, thanks for your time and detailed questions. No worries. When we hear about an issue we're very interested to followup and resolve it, regardless of the source. We take the project goals of high reliability/availablity _very_ seriously,

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-04-30 Thread Aaron Crow
Hi Patrick, thanks for your time and detailed questions. We're running on Java build 1.6.0_14-b08, on Ubuntu 4.2.4-1ubuntu3. Below is output from a recent stat, and a question about node count. For your other questions, I should save your time with a batch reply: I wasn't tracking nearly enough th

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-04-28 Thread Patrick Hunt
Btw, are you monitoring the ZK server jvms? Please take a look at http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands It would be interesting if you could run commmands such as "stat" against your currently running cluster. In particular I'd be interested to know

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-04-28 Thread Patrick Hunt
Hi Aaron, some questions/comments below: On 04/28/2010 06:29 PM, Aaron Crow wrote: We were running version 3.2.2 for about a month and it was working well for us. Then late this past Saturday night, our cluster went pathological. One of the 3 ZK servers spewed many WARNs (see below), and the oth