Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Ted Dunning
Yes.  That is roughly what I mean.

If one server starts a GC, it can effectively go offline.  That might
pressure the other servers enough that one of them starts a GC.

This is unlikely with your GC settings, but you should turn on the verbose
GC logging to be sure.

On Wed, May 12, 2010 at 10:09 PM, Aaron Crow wrote:

> But, sorry, I'm not sure what you
> mean about synchronized GC? You mean maybe all our Zookeepers get hit by a
> big GC at the same time, due to the very large db?
>


Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Patrick Hunt


On 05/12/2010 08:30 PM, Aaron Crow wrote:

I may have a better idea of what caused the trouble. I way, WAY
underestimated the number of nodes we collect over time. Right now we're at
1.9 million. This isn't a bug of our application; it's actually a feature
(but perhaps an ill-conceived one).

A most recent snapshot from a Zookeeper db is 227MB. If I scp it over to one
of the other Zookeeper hosts, it takes about 4 seconds.



Nice. You probably hold the record for largest (znode count) production 
ZK repo. Largest I've heard of at least.



Now, there are some things I can do to limit the number of nodes we collect.
My question is, how deadly could this node size be for us? Patrick mentioned
to me that he's run Zookeeper with this many nodes, but you need to be
careful about tuning. We're currently running with the recommended JVM
settings (see below). We're using different drives for the 2 different kinds
of data dirs that Zookeeper needs. We may also have the option of running on
a 64 bit OS with added RAM, if it's worth it. What about timeout settings?
I'm copying in our current settings below, are those ok?



ALA you have enough memory/disk/IO you should be ok. Are you monitoring 
the operation latency on the servers? (via 4letter words, such as "stat"?)


You might increase the init/sync limits a bit to ensure that the 
followers have enough time to d/l the snapshot, deserialize it, and get 
setup with the leader (if this takes too long the quorum will fail and 
reelect a new leader, which might happen indefinitely).



Or should we just figure out how to keep our node count much lower? And how
low is "definitely pretty safe"?



There's really no "max" - it's just dependent on your resources. Memory 
in particular.


You should turn on incremental GC mode though (-XX:+CMSIncrementalMode), 
otw large GC pauses will wreck your latencies. Checkout this link 
(below), verbose gc is also useful to track down issues later (if 
something bad happens you can use it to rule out/in GC as an issue)


http://java.sun.com/docs/hotspot/gc5.0/gc_tuning_5.html#0.0.0.0.Incremental%20mode%7Coutline

Regards,

Patrick



=== some current settings ===
-XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -Xms2560m -Xmx2560m
tickTime=2000
initLimit=10
syncLimit=5



Many thanks in advance for any good advice.
Aaron


On Wed, Apr 28, 2010 at 10:47 PM, Patrick Hunt  wrote:


Hi Aaron, some questions/comments below:


On 04/28/2010 06:29 PM, Aaron Crow wrote:


We were running version 3.2.2 for about a month and it was working well
for
us. Then late this past Saturday night, our cluster went pathological. One
of the 3 ZK servers spewed many WARNs (see below), and the other 2 servers
were almost constantly pegging the CPU. All three servers are on separate
machines. From what we could tell, the machines were fine... networking
fine, disk fine, etc. The ZK clients were completely unable to complete
their connections to ZK.



These machines are local (not wan) connected then? What OS and java version
are you using?

Do you see any FATAL or ERROR level messages in the logs?

It would help to look at your zk config files for these servers. Could you
provide (you might want to create a JIRA first, then just attach configs and
other details/collateral to that, easier than dealing with email)

If you have logs for the time period and can share that would be most
useful. (again, gzip and attach to the jira)


  We tried all sorts of restarts, running zkCleanup, etc. We even completely

shut down our clients... and the pathology continued. Our workaround was
to
do an urgent upgrade to version 3.3.0. The new ZK cluster with 3.3.0 has
been running well for us... so far...



Off hand and with the data we have so far nothing sticks out that 3.3 would
have resolved (JIRA is conveniently down for the last hour or so so I can't
review right now). Although there were some changes to reduce memory
consumption (see below).


  I realize that, sadly, this message doesn't contain nearly enough details

to
trace exactly what happened. I guess I'm wondering if anyone has seen this
general scenario, and/or knows how to prevent? Is there anything we might
be
doing client side to trigger this? Our application level request frequency
is maybe a few requests to Zookeeper per second, times 5 clients
applications. If we detect a SESSION EXPIRED, we do a simple "create new
client and use that instead". And we were seeing this happen occasionally.



What are the client doing? Do you have a large number/size of znodes?

Do you see any OutOfMemoryError in the logs?

Could the ZK server java process be swapping? Are you monitoring GC,
perhaps large GC pauses are happening?

I have a suspicion that one of a few things might be happening. I see the
following in your original email:



:followerhand...@302] - Sending snapshot last zxid of peer is

0xd0007d66d

zxid of leader is 0xf
2010-04-24 23:06:03,254 - ERROR [FollowerHandler-/10.

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Aaron Crow
Hi Ted, yeah it's a big number, eh? We're essentially using Zookeeper to
track the state of cache entries, and currently we don't bound our cache. I
didn't realize how many entries we grow to over a long period of time, until
I started counting nodes in Zookeeper. But, sorry, I'm not sure what you
mean about synchronized GC? You mean maybe all our Zookeepers get hit by a
big GC at the same time, due to the very large db?


On Wed, May 12, 2010 at 9:21 PM, Ted Dunning  wrote:

> Impressive number here, especially at your quoted "few per second" rate.
>
> Are you sure that you haven't inadvertently synchronized GC on multiple
> machines?
>
> On Wed, May 12, 2010 at 8:30 PM, Aaron Crow 
> wrote:
>
> > Right now we're at
> > 1.9 million. This isn't a bug of our application; it's actually a feature
> > (but perhaps an ill-conceived one).
> >
>


Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Ted Dunning
Impressive number here, especially at your quoted "few per second" rate.

Are you sure that you haven't inadvertently synchronized GC on multiple
machines?

On Wed, May 12, 2010 at 8:30 PM, Aaron Crow  wrote:

> Right now we're at
> 1.9 million. This isn't a bug of our application; it's actually a feature
> (but perhaps an ill-conceived one).
>


Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-05-12 Thread Aaron Crow
I may have a better idea of what caused the trouble. I way, WAY
underestimated the number of nodes we collect over time. Right now we're at
1.9 million. This isn't a bug of our application; it's actually a feature
(but perhaps an ill-conceived one).

A most recent snapshot from a Zookeeper db is 227MB. If I scp it over to one
of the other Zookeeper hosts, it takes about 4 seconds.

Now, there are some things I can do to limit the number of nodes we collect.
My question is, how deadly could this node size be for us? Patrick mentioned
to me that he's run Zookeeper with this many nodes, but you need to be
careful about tuning. We're currently running with the recommended JVM
settings (see below). We're using different drives for the 2 different kinds
of data dirs that Zookeeper needs. We may also have the option of running on
a 64 bit OS with added RAM, if it's worth it. What about timeout settings?
I'm copying in our current settings below, are those ok?

Or should we just figure out how to keep our node count much lower? And how
low is "definitely pretty safe"?


=== some current settings ===
-XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC -Xms2560m -Xmx2560m
tickTime=2000
initLimit=10
syncLimit=5



Many thanks in advance for any good advice.
Aaron


On Wed, Apr 28, 2010 at 10:47 PM, Patrick Hunt  wrote:

> Hi Aaron, some questions/comments below:
>
>
> On 04/28/2010 06:29 PM, Aaron Crow wrote:
>
>> We were running version 3.2.2 for about a month and it was working well
>> for
>> us. Then late this past Saturday night, our cluster went pathological. One
>> of the 3 ZK servers spewed many WARNs (see below), and the other 2 servers
>> were almost constantly pegging the CPU. All three servers are on separate
>> machines. From what we could tell, the machines were fine... networking
>> fine, disk fine, etc. The ZK clients were completely unable to complete
>> their connections to ZK.
>>
>>
> These machines are local (not wan) connected then? What OS and java version
> are you using?
>
> Do you see any FATAL or ERROR level messages in the logs?
>
> It would help to look at your zk config files for these servers. Could you
> provide (you might want to create a JIRA first, then just attach configs and
> other details/collateral to that, easier than dealing with email)
>
> If you have logs for the time period and can share that would be most
> useful. (again, gzip and attach to the jira)
>
>
>  We tried all sorts of restarts, running zkCleanup, etc. We even completely
>> shut down our clients... and the pathology continued. Our workaround was
>> to
>> do an urgent upgrade to version 3.3.0. The new ZK cluster with 3.3.0 has
>> been running well for us... so far...
>>
>>
> Off hand and with the data we have so far nothing sticks out that 3.3 would
> have resolved (JIRA is conveniently down for the last hour or so so I can't
> review right now). Although there were some changes to reduce memory
> consumption (see below).
>
>
>  I realize that, sadly, this message doesn't contain nearly enough details
>> to
>> trace exactly what happened. I guess I'm wondering if anyone has seen this
>> general scenario, and/or knows how to prevent? Is there anything we might
>> be
>> doing client side to trigger this? Our application level request frequency
>> is maybe a few requests to Zookeeper per second, times 5 clients
>> applications. If we detect a SESSION EXPIRED, we do a simple "create new
>> client and use that instead". And we were seeing this happen occasionally.
>>
>>
> What are the client doing? Do you have a large number/size of znodes?
>
> Do you see any OutOfMemoryError in the logs?
>
> Could the ZK server java process be swapping? Are you monitoring GC,
> perhaps large GC pauses are happening?
>
> I have a suspicion that one of a few things might be happening. I see the
> following in your original email:
>
>
> > :followerhand...@302] - Sending snapshot last zxid of peer is
> 0xd0007d66d
> > zxid of leader is 0xf
> > 2010-04-24 23:06:03,254 - ERROR [FollowerHandler-/10.0.10.116:34405
> > :followerhand...@415] - Unexpected exception causing shutdown while sock
> > still open
> > java.net.SocketException: Broken pipe
> >  at java.net.SocketOutputStream.socketWrite0(Native Method)
> >  at
>
>  this log section is saying that a ZK server (follower) connected to
> the elected leader and started to download (leader is sending here) the
> latest ZK snapshot (db) to the follower while it is recovering. However
> before the follower fully downloads the snapshot the connection (pipe) is
> broken and this operation fails. At this point the leader itself will
> probably give up the lead, a new election will happen, and it might be the
> case that this same problem occurs - the follower tries to download the
> snapshot but it again fails (and this loops forever)
>
> This issue can happen if the servers are slowed, either by slow
> interconnects (wan), network connectivity/perf prob

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-04-30 Thread Patrick Hunt


On 04/30/2010 10:16 AM, Aaron Crow wrote:

Hi Patrick, thanks for your time and detailed questions.



No worries. When we hear about an issue we're very interested to 
followup and resolve it, regardless of the source. We take the project 
goals of high reliability/availablity _very_ seriously, we know that our 
users are deploying to mission critical environments.



We're running on Java build 1.6.0_14-b08, on Ubuntu 4.2.4-1ubuntu3. Below is
output from a recent stat, and a question about node count. For your other
questions, I should save your time with a batch reply: I wasn't tracking
nearly enough things (like logs), so it might not be fruitful to try to
investigate this failure in detail.



NP. If you had those available I would be interested to take a look but 
I think this node count size (below) is probably the issue.



I've since started rolling out better settings, including the memory and GC
settings recommended in the wiki, and logging to ROLLINGFILE. Also logging
and archiving verbose gc. We've been fine tuning our client app with similar
settings, but my bad that we didn't roll out ZK itself with these settings
as well. (Oh and I also set my browser's homepage to
http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing ;-)


Great. That page is really good for challenging assumptions.


So if we have any more serious issues, I think I should be able to provide
better details at that time.



Great!


I do have a question about node count, though. Below is output from stat on
one of our servers. Node count says *1182653*. Really!? Well, from our
application stand point, we only create and use hundreds of nodes at the
most. Certainly not any amount in the neighborhood of 1 million. Does this
demonstrate a serious problem?



It certainly looks like this might be the issue. One millions znodes is 
alot, but I've tested with more (upwards of 5 million znodes with 
25million concurrent watches is the most I remember testing) 
successfully. It's really just a matter of tuning the GC and starting 
the JVM with sufficient heap. It could be that the counter is wrong, but 
I doubt it (never seen that before). Is the count approximately the same 
on all servers in the ensemble?


Given that you are only are expecting hundreds of znodes, I suspect that 
you have a "leak" of znodes somewhere.


Suggestion - Fire up the java client shell (bin/zkCli.sh or just run 
ZooKeeperMain directly), connect to one of your servers and use the "ls" 
command to look around. You might be surprised at what you find. (or you 
might find that the count is wrong, etc...) It's a good sanity check.


Patrick



Thanks,
Aaron

rails_dep...@task01:~/zookeeper-3.3.0/bin$ echo stat | nc 127.0.0.1 2181

Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT

Clients:

  /10.0.10.116:34648[1](queued=0,recved=86200,sent=86200)

  /10.0.10.116:45609[1](queued=0,recved=27695,sent=27695)

  /10.0.10.120:48432[1](queued=0,recved=26030,sent=26030)

  /10.0.10.117:53336[1](queued=0,recved=43100,sent=43100)

  /10.0.10.120:35087[1](queued=0,recved=4268,sent=4268)

  /10.0.10.116:49526[1](queued=0,recved=4273,sent=4273)

  /10.0.10.118:45614[1](queued=0,recved=43624,sent=43624)

  /10.0.10.116:45600[1](queued=0,recved=27704,sent=27704)

  /10.0.10.120:48440[1](queued=0,recved=7161,sent=7161)

  /10.0.10.120:48437[1](queued=0,recved=7180,sent=7180)

  /10.0.10.118:59310[1](queued=0,recved=63205,sent=63205)

  /10.0.10.116:51072[1](queued=0,recved=14516,sent=14516)

  /10.0.10.116:51071[1](queued=0,recved=14516,sent=14516)

  /10.0.10.119:34097[1](queued=0,recved=42984,sent=42984)

  /10.0.10.119:41868[1](queued=0,recved=18558,sent=18558)

  /10.0.10.120:48441[1](queued=0,recved=21712,sent=21712)

  /10.0.10.116:49528[1](queued=0,recved=12967,sent=12967)

  /127.0.0.1:37234[1](queued=0,recved=1,sent=0)



Latency min/avg/max: 0/0/105


Received: 497790

Sent: 497788

Outstanding: 0

Zxid: 0xa0003aa4b

Mode: follower

*Node count: 1182653*




On Wed, Apr 28, 2010 at 10:56 PM, Patrick Hunt  wrote:


Btw, are you monitoring the ZK server jvms? Please take a look at


http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands

It would be interesting if you could run commmands such as "stat" against
your currently running cluster. In particular I'd be interested to know what
you see for latency and "node count" in the stat response. Run this command
against each of your servers in the ensemble. For example high "max latency"
might indicate that something (usually swap/gc) is causing the server to
respond slowly in some cases.

Patrick


On 04/28/2010 10:47 PM, Patrick Hunt wrote:


Hi Aaron, some questions/comments below:

On 04/28/2010 06:29 PM, Aaron Crow wrote:


We were running version 3.2.2 for about a month and it was working
well for
us. Then late this past Saturday night, our cluster went pathological.
One
of the 3 ZK servers spewed many WARNs (see below), and the other 2
servers
were almost con

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-04-30 Thread Aaron Crow
Hi Patrick, thanks for your time and detailed questions.

We're running on Java build 1.6.0_14-b08, on Ubuntu 4.2.4-1ubuntu3. Below is
output from a recent stat, and a question about node count. For your other
questions, I should save your time with a batch reply: I wasn't tracking
nearly enough things (like logs), so it might not be fruitful to try to
investigate this failure in detail.

I've since started rolling out better settings, including the memory and GC
settings recommended in the wiki, and logging to ROLLINGFILE. Also logging
and archiving verbose gc. We've been fine tuning our client app with similar
settings, but my bad that we didn't roll out ZK itself with these settings
as well. (Oh and I also set my browser's homepage to
http://en.wikipedia.org/wiki/Fallacies_of_Distributed_Computing ;-)

So if we have any more serious issues, I think I should be able to provide
better details at that time.

I do have a question about node count, though. Below is output from stat on
one of our servers. Node count says *1182653*. Really!? Well, from our
application stand point, we only create and use hundreds of nodes at the
most. Certainly not any amount in the neighborhood of 1 million. Does this
demonstrate a serious problem?


Thanks,
Aaron

rails_dep...@task01:~/zookeeper-3.3.0/bin$ echo stat | nc 127.0.0.1 2181

Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT

Clients:

 /10.0.10.116:34648[1](queued=0,recved=86200,sent=86200)

 /10.0.10.116:45609[1](queued=0,recved=27695,sent=27695)

 /10.0.10.120:48432[1](queued=0,recved=26030,sent=26030)

 /10.0.10.117:53336[1](queued=0,recved=43100,sent=43100)

 /10.0.10.120:35087[1](queued=0,recved=4268,sent=4268)

 /10.0.10.116:49526[1](queued=0,recved=4273,sent=4273)

 /10.0.10.118:45614[1](queued=0,recved=43624,sent=43624)

 /10.0.10.116:45600[1](queued=0,recved=27704,sent=27704)

 /10.0.10.120:48440[1](queued=0,recved=7161,sent=7161)

 /10.0.10.120:48437[1](queued=0,recved=7180,sent=7180)

 /10.0.10.118:59310[1](queued=0,recved=63205,sent=63205)

 /10.0.10.116:51072[1](queued=0,recved=14516,sent=14516)

 /10.0.10.116:51071[1](queued=0,recved=14516,sent=14516)

 /10.0.10.119:34097[1](queued=0,recved=42984,sent=42984)

 /10.0.10.119:41868[1](queued=0,recved=18558,sent=18558)

 /10.0.10.120:48441[1](queued=0,recved=21712,sent=21712)

 /10.0.10.116:49528[1](queued=0,recved=12967,sent=12967)

 /127.0.0.1:37234[1](queued=0,recved=1,sent=0)


> Latency min/avg/max: 0/0/105

Received: 497790

Sent: 497788

Outstanding: 0

Zxid: 0xa0003aa4b

Mode: follower

*Node count: 1182653*




On Wed, Apr 28, 2010 at 10:56 PM, Patrick Hunt  wrote:

> Btw, are you monitoring the ZK server jvms? Please take a look at
>
>
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands
>
> It would be interesting if you could run commmands such as "stat" against
> your currently running cluster. In particular I'd be interested to know what
> you see for latency and "node count" in the stat response. Run this command
> against each of your servers in the ensemble. For example high "max latency"
> might indicate that something (usually swap/gc) is causing the server to
> respond slowly in some cases.
>
> Patrick
>
>
> On 04/28/2010 10:47 PM, Patrick Hunt wrote:
>
>> Hi Aaron, some questions/comments below:
>>
>> On 04/28/2010 06:29 PM, Aaron Crow wrote:
>>
>>> We were running version 3.2.2 for about a month and it was working
>>> well for
>>> us. Then late this past Saturday night, our cluster went pathological.
>>> One
>>> of the 3 ZK servers spewed many WARNs (see below), and the other 2
>>> servers
>>> were almost constantly pegging the CPU. All three servers are on separate
>>> machines. From what we could tell, the machines were fine... networking
>>> fine, disk fine, etc. The ZK clients were completely unable to complete
>>> their connections to ZK.
>>>
>>>
>> These machines are local (not wan) connected then? What OS and java
>> version are you using?
>>
>> Do you see any FATAL or ERROR level messages in the logs?
>>
>> It would help to look at your zk config files for these servers. Could
>> you provide (you might want to create a JIRA first, then just attach
>> configs and other details/collateral to that, easier than dealing with
>> email)
>>
>> If you have logs for the time period and can share that would be most
>> useful. (again, gzip and attach to the jira)
>>
>>  We tried all sorts of restarts, running zkCleanup, etc. We even
>>> completely
>>> shut down our clients... and the pathology continued. Our workaround
>>> was to
>>> do an urgent upgrade to version 3.3.0. The new ZK cluster with 3.3.0 has
>>> been running well for us... so far...
>>>
>>>
>> Off hand and with the data we have so far nothing sticks out that 3.3
>> would have resolved (JIRA is conveniently down for the last hour or so
>> so I can't review right now). Although there were some changes to reduce
>> memory consumption (see below).
>>
>>  I realize that, sadly, this m

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-04-28 Thread Patrick Hunt

Btw, are you monitoring the ZK server jvms? Please take a look at

http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands

It would be interesting if you could run commmands such as "stat" 
against your currently running cluster. In particular I'd be interested 
to know what you see for latency and "node count" in the stat response. 
Run this command against each of your servers in the ensemble. For 
example high "max latency" might indicate that something (usually 
swap/gc) is causing the server to respond slowly in some cases.


Patrick

On 04/28/2010 10:47 PM, Patrick Hunt wrote:

Hi Aaron, some questions/comments below:

On 04/28/2010 06:29 PM, Aaron Crow wrote:

We were running version 3.2.2 for about a month and it was working
well for
us. Then late this past Saturday night, our cluster went pathological.
One
of the 3 ZK servers spewed many WARNs (see below), and the other 2
servers
were almost constantly pegging the CPU. All three servers are on separate
machines. From what we could tell, the machines were fine... networking
fine, disk fine, etc. The ZK clients were completely unable to complete
their connections to ZK.



These machines are local (not wan) connected then? What OS and java
version are you using?

Do you see any FATAL or ERROR level messages in the logs?

It would help to look at your zk config files for these servers. Could
you provide (you might want to create a JIRA first, then just attach
configs and other details/collateral to that, easier than dealing with
email)

If you have logs for the time period and can share that would be most
useful. (again, gzip and attach to the jira)


We tried all sorts of restarts, running zkCleanup, etc. We even
completely
shut down our clients... and the pathology continued. Our workaround
was to
do an urgent upgrade to version 3.3.0. The new ZK cluster with 3.3.0 has
been running well for us... so far...



Off hand and with the data we have so far nothing sticks out that 3.3
would have resolved (JIRA is conveniently down for the last hour or so
so I can't review right now). Although there were some changes to reduce
memory consumption (see below).


I realize that, sadly, this message doesn't contain nearly enough
details to
trace exactly what happened. I guess I'm wondering if anyone has seen
this
general scenario, and/or knows how to prevent? Is there anything we
might be
doing client side to trigger this? Our application level request
frequency
is maybe a few requests to Zookeeper per second, times 5 clients
applications. If we detect a SESSION EXPIRED, we do a simple "create new
client and use that instead". And we were seeing this happen
occasionally.



What are the client doing? Do you have a large number/size of znodes?

Do you see any OutOfMemoryError in the logs?

Could the ZK server java process be swapping? Are you monitoring GC,
perhaps large GC pauses are happening?

I have a suspicion that one of a few things might be happening. I see
the following in your original email:

 > :followerhand...@302] - Sending snapshot last zxid of peer is
0xd0007d66d
 > zxid of leader is 0xf
 > 2010-04-24 23:06:03,254 - ERROR [FollowerHandler-/10.0.10.116:34405
 > :followerhand...@415] - Unexpected exception causing shutdown while sock
 > still open
 > java.net.SocketException: Broken pipe
 > at java.net.SocketOutputStream.socketWrite0(Native Method)
 > at

 this log section is saying that a ZK server (follower) connected to
the elected leader and started to download (leader is sending here) the
latest ZK snapshot (db) to the follower while it is recovering. However
before the follower fully downloads the snapshot the connection (pipe)
is broken and this operation fails. At this point the leader itself will
probably give up the lead, a new election will happen, and it might be
the case that this same problem occurs - the follower tries to download
the snapshot but it again fails (and this loops forever)

This issue can happen if the servers are slowed, either by slow
interconnects (wan), network connectivity/perf problems, swapping of the
JVM, and most commonly GC in the VM.

See this recent case:
http://www.mail-archive.com/zookeeper-...@hadoop.apache.org/msg08083.html

What is the size of your snapshot file? (you can see this in your
zookeeper datadirectory, unless you wiped this during your attempt to
cleanup/recover/upgrade?).

Seeing your config file, and getting a sense of the snapshot file size
will give us more insight.

Patrick


Many, many thanks in advance for any insights or advice.


Example log output from the ZK server that spewed many WARNs:

===

2010-04-24 22:31:24,551 - WARN
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@318] - Exception when
following
the leader
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at

Re: Pathological ZK cluster: 1 server verbosely WARN'ing, other 2 servers pegging CPU

2010-04-28 Thread Patrick Hunt

Hi Aaron, some questions/comments below:

On 04/28/2010 06:29 PM, Aaron Crow wrote:

We were running version 3.2.2 for about a month and it was working well for
us. Then late this past Saturday night, our cluster went pathological. One
of the 3 ZK servers spewed many WARNs (see below), and the other 2 servers
were almost constantly pegging the CPU. All three servers are on separate
machines. From what we could tell, the machines were fine... networking
fine, disk fine, etc. The ZK clients were completely unable to complete
their connections to ZK.



These machines are local (not wan) connected then? What OS and java 
version are you using?


Do you see any FATAL or ERROR level messages in the logs?

It would help to look at your zk config files for these servers. Could 
you provide (you might want to create a JIRA first, then just attach 
configs and other details/collateral to that, easier than dealing with 
email)


If you have logs for the time period and can share that would be most 
useful. (again, gzip and attach to the jira)



We tried all sorts of restarts, running zkCleanup, etc. We even completely
shut down our clients... and the pathology continued. Our workaround was to
do an urgent upgrade to version 3.3.0. The new ZK cluster with 3.3.0 has
been running well for us... so far...



Off hand and with the data we have so far nothing sticks out that 3.3 
would have resolved (JIRA is conveniently down for the last hour or so 
so I can't review right now). Although there were some changes to reduce 
memory consumption (see below).



I realize that, sadly, this message doesn't contain nearly enough details to
trace exactly what happened. I guess I'm wondering if anyone has seen this
general scenario, and/or knows how to prevent? Is there anything we might be
doing client side to trigger this? Our application level request frequency
is maybe a few requests to Zookeeper per second, times 5 clients
applications. If we detect a SESSION EXPIRED, we do a simple "create new
client and use that instead". And we were seeing this happen occasionally.



What are the client doing? Do you have a large number/size of znodes?

Do you see any OutOfMemoryError in the logs?

Could the ZK server java process be swapping? Are you monitoring GC, 
perhaps large GC pauses are happening?


I have a suspicion that one of a few things might be happening. I see 
the following in your original email:


> :followerhand...@302] - Sending snapshot last zxid of peer is 0xd0007d66d
> zxid of leader is 0xf
> 2010-04-24 23:06:03,254 - ERROR [FollowerHandler-/10.0.10.116:34405
> :followerhand...@415] - Unexpected exception causing shutdown while sock
> still open
> java.net.SocketException: Broken pipe
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at

 this log section is saying that a ZK server (follower) connected to 
the elected leader and started to download (leader is sending here) the 
latest ZK snapshot (db) to the follower while it is recovering. However 
before the follower fully downloads the snapshot the connection (pipe) 
is broken and this operation fails. At this point the leader itself will 
probably give up the lead, a new election will happen, and it might be 
the case that this same problem occurs - the follower tries to download 
the snapshot but it again fails (and this loops forever)


This issue can happen if the servers are slowed, either by slow 
interconnects (wan), network connectivity/perf problems, swapping of the 
JVM, and most commonly GC in the VM.


See this recent case:
http://www.mail-archive.com/zookeeper-...@hadoop.apache.org/msg08083.html

What is the size of your snapshot file? (you can see this in your 
zookeeper datadirectory, unless you wiped this during your attempt to 
cleanup/recover/upgrade?).


Seeing your config file, and getting a sense of the snapshot file size 
will give us more insight.


Patrick


Many, many thanks in advance for any insights or advice.


Example log output from the ZK server that spewed many WARNs:

===

2010-04-24 22:31:24,551 - WARN
  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:follo...@318] - Exception when following
the leader
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:129)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
at java.io.DataInputStream.readInt(DataInputStream.java:370)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:66)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
at org.apache.zookeeper.server.quorum.Follower.readPacket(Follower.java:114)
at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:193)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(Quo