Re: Debugging help for SessionExpiredException

2010-06-16 Thread Eric Bowman
Setting up a little process to run overnight that appends a timestamp to a file once per second or so can be a very effective tool for ruling out, for example, "extra-dimensional" VM effects. On 06/16/2010 12:15 AM, Patrick Hunt wrote: > I'm not very experienced personally with running zk on ec2 s

Re: Debugging help for SessionExpiredException

2010-06-15 Thread Patrick Hunt
I'm not very experienced personally with running zk on ec2 smalls, Ted usually has the ec2 related insight. Given these boxes are not loaded or lightly loaded, and you've ruled out gc/swap, the only thing I can think of is that something is going on under the covers at the vm level that's causi

Re: Debugging help for SessionExpiredException

2010-06-15 Thread Jordan Zimmerman
They're small instances. The thing is that these machines are doing next to no work. We're just running simple little tests. The session expiration has not happened while I've been watching. It tends to happen over night. -JZ On Jun 15, 2010, at 1:50 PM, Ted Dunning wrote: > As usual, the ZK t

Re: Debugging help for SessionExpiredException

2010-06-15 Thread Ted Dunning
As usual, the ZK team provides the best feedback. I would be bold enough to ask what kind of ec2 instances you are running on. Small instances are small chunks of larger machines and are sometimes subject to competition for resources from the other tenants. On Tue, Jun 15, 2010 at 12:30 PM, Patr

Re: Debugging help for SessionExpiredException

2010-06-15 Thread Patrick Hunt
Yes, 965 seconds is huge. The times I've seen such huge latencies are (in order of frequency seen): 1) when the java process gc's, swaps, or both and/or 2) disk utilization on the ZK server is high and/or 3) under-provisioned virtual machines (ie vmware) Re 2) in some cases we've seen users

Re: Debugging help for SessionExpiredException

2010-06-15 Thread Jordan Zimmerman
Yes - the session drop happened again. I did the stat. The max latency is huge (I assume that's in ms). Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT Clients: /10.243.14.179:57300[1](queued=0,recved=0,sent=0) /207.111.236.2:51493[1](queued=0,recved=1,sent=0) /10.243.13.191:444

Re: Debugging help for SessionExpiredException

2010-06-15 Thread Ted Dunning
Jordan, Good step to get this info. I have to ask, did you have your disconnect problem last night as well? (just checking) What does the stat command on ZK give you for each server? On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman < jzimmer...@proofpoint.com> wrote: > More on this... > > I

Re: Debugging help for SessionExpiredException

2010-06-15 Thread Jordan Zimmerman
More on this... I ran last night with verbose GC on our client. I analyzed the GC log in gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the problem. NOTE we're running on Amazon EC2. -JZ On Ju

Re: Debugging help for SessionExpiredException

2010-06-11 Thread Patrick Hunt
Session expiration is due to the server not hearing heartbeats from the client. So either the client is partitioned from the server, or the client is not sending heartbeats for some reason, typically this is due to the client JVM gc'ing or swapping. Patrick On 06/10/2010 04:14 PM, Ted Dunning

Re: Debugging help for SessionExpiredException

2010-06-10 Thread Jordan Zimmerman
On Jun 9, 2010, at 4:21 PM, Patrick Hunt wrote: > In particular you might look at GC/swapping on your clients, that's the most > common case we see for session expiration (apart from the obvious -- network > level connectivity failures). In one case I remember there was very heavy > network lo

Re: Debugging help for SessionExpiredException

2010-06-10 Thread Ted Dunning
Uh the options I was recommending were for your CLIENT. You should have similar settings on ZK, but it is your client that is likely to be pausing. On Thu, Jun 10, 2010 at 4:08 PM, Jordan Zimmerman wrote: > The thing is, this is a test instance (on AWS/EC2) that isn't getting a lot > of tra

Re: Debugging help for SessionExpiredException

2010-06-10 Thread Jordan Zimmerman
The thing is, this is a test instance (on AWS/EC2) that isn't getting a lot of traffic. i.e. 1 zookeeper instance that we're testing with. On Jun 10, 2010, at 4:06 PM, Ted Dunning wrote: > Possibly. > > I have seen GC times of > 4 minutes on some large processes. Better to set > the GC paramet

Re: Debugging help for SessionExpiredException

2010-06-10 Thread Ted Dunning
Possibly. I have seen GC times of > 4 minutes on some large processes. Better to set the GC parameters so you don't get long pauses. On http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting it mentions using the "-XX:ParallelGCThreads=8 -XX:+UseConcMarkSweepGC" options. I recommend adding

Re: Debugging help for SessionExpiredException

2010-06-09 Thread Patrick Hunt
On 06/09/2010 03:35 PM, Lei Zhang wrote: We've consistently run into issues with vmware workstation (CentOS as guest OS) on Windows host: just by leaving the cluster idle over night leads to zk session expire issue. My theory is: windows may have gone to hibernation, the zk heartbeat logic hibe

Re: Debugging help for SessionExpiredException

2010-06-09 Thread Ted Dunning
This can depend on which kind of instance you invoke as well. The smallest instances disappear for short periods of time and that can lead to surprises. On Wed, Jun 9, 2010 at 3:35 PM, Lei Zhang wrote: > On EC2 (still CentOS as guest OS), we consistently run into zk session > expire issue when

Re: Debugging help for SessionExpiredException

2010-06-09 Thread Lei Zhang
We use zookeeper in virtualized environment, both on Amazon EC2 and on Vmware Workstation on local machines. We've consistently run into issues with vmware workstation (CentOS as guest OS) on Windows host: just by leaving the cluster idle over night leads to zk session expire issue. My theory is:

Re: Debugging help for SessionExpiredException

2010-06-09 Thread Stephen Green
On Wed, Jun 9, 2010 at 2:47 PM, Patrick Hunt wrote: > My guess is that your client is gcing for long periods of time - you can > rule this in/out by turning on gc logging in your clients and then viewing > the results after another such incident happens (try gchisto for graphical > view) >From re

Re: Debugging help for SessionExpiredException

2010-06-09 Thread Patrick Hunt
"100mb partition"? sounds like virtualization. resource starvation (worse in virtualized env) is a common cause of this. Are your clients gcing/swapping at all? If a client gc's for long periods of time the heartbeat thread won't be able to run and the server will expire the session. There is a