Re: zookeeper watch triggered multiple times on same event
I don't think this should be possible (if it happens it's a bug in zk). Perhaps, for some reason, there really are 2 change actions (children created, or the same child created twice) and not just one? Re-registering the watch "inside the watch" is fine. The server sends watch notifications as one way messages, when it notices a znode child list has changed it fires off change messages to all the registered clients. The client then receives the notification and calls the handler. Patrick On 06/15/2010 05:47 PM, Jun Rao wrote: Hi, I get a quick question on ZK 3.2.2. Here is a sequence of events during a test: 1. client 1 creates an ephemeral node under /a 2. client 1 sets a watch using getChildren on /a 3. client 2 creates an ephemeral node under /a 4. client 1's watch gets triggered (a node change event). Inside the watch, client 1 does getChildren on /a and sets the watch. 5. client 1's watch gets triggered again (a node change event) My question is why the same node change event gets triggered twice. It seems that step 5 shouldn't have happened. Thanks, Jun
zookeeper watch triggered multiple times on same event
Hi, I get a quick question on ZK 3.2.2. Here is a sequence of events during a test: 1. client 1 creates an ephemeral node under /a 2. client 1 sets a watch using getChildren on /a 3. client 2 creates an ephemeral node under /a 4. client 1's watch gets triggered (a node change event). Inside the watch, client 1 does getChildren on /a and sets the watch. 5. client 1's watch gets triggered again (a node change event) My question is why the same node change event gets triggered twice. It seems that step 5 shouldn't have happened. Thanks, Jun
Re: Debugging help for SessionExpiredException
I'm not very experienced personally with running zk on ec2 smalls, Ted usually has the ec2 related insight. Given these boxes are not loaded or lightly loaded, and you've ruled out gc/swap, the only thing I can think of is that something is going on under the covers at the vm level that's causing the high latency you're seeing. You're seeing 15 _minutes_ max latency. I can't think of what would cause that inside zk. Any chance that the VM is shutting down or "freezing" during that period? I dont' know. Are you monitoring that system from a second system? Perhaps that might shed some light (monitor the cpu/disk activity using some monitoring tool like ganglia, nagios, etc... or even more primitive, perhaps doing a ping to that system and tracking the round trip time/packet loss, dump to a file and review the next day, etc...) Patrick On 06/15/2010 03:59 PM, Jordan Zimmerman wrote: They're small instances. The thing is that these machines are doing next to no work. We're just running simple little tests. The session expiration has not happened while I've been watching. It tends to happen over night. -JZ On Jun 15, 2010, at 1:50 PM, Ted Dunning wrote: As usual, the ZK team provides the best feedback. I would be bold enough to ask what kind of ec2 instances you are running on. Small instances are small chunks of larger machines and are sometimes subject to competition for resources from the other tenants. On Tue, Jun 15, 2010 at 12:30 PM, Patrick Hunt wrote: 3) under-provisioned virtual machines (ie vmware) ... Given that you've ruled out the gc (most common), disk utilization would be the next thing to check.
Re: Debugging help for SessionExpiredException
They're small instances. The thing is that these machines are doing next to no work. We're just running simple little tests. The session expiration has not happened while I've been watching. It tends to happen over night. -JZ On Jun 15, 2010, at 1:50 PM, Ted Dunning wrote: > As usual, the ZK team provides the best feedback. > > I would be bold enough to ask what kind of ec2 instances you are running on. > Small instances are small chunks of larger machines and are sometimes subject > to competition for resources from the other tenants. > > On Tue, Jun 15, 2010 at 12:30 PM, Patrick Hunt wrote: > 3) under-provisioned virtual machines (ie vmware) > > ... > > Given that you've ruled out the gc (most common), disk utilization would be > the next thing to check. >
Re: Debugging help for SessionExpiredException
As usual, the ZK team provides the best feedback. I would be bold enough to ask what kind of ec2 instances you are running on. Small instances are small chunks of larger machines and are sometimes subject to competition for resources from the other tenants. On Tue, Jun 15, 2010 at 12:30 PM, Patrick Hunt wrote: > 3) under-provisioned virtual machines (ie vmware) > > ... > > Given that you've ruled out the gc (most common), disk utilization would be > the next thing to check. >
Re: Debugging help for SessionExpiredException
Yes, 965 seconds is huge. The times I've seen such huge latencies are (in order of frequency seen): 1) when the java process gc's, swaps, or both and/or 2) disk utilization on the ZK server is high and/or 3) under-provisioned virtual machines (ie vmware) Re 2) in some cases we've seen users running multiple IO heavy processes on the same host as the zk server. hdfs data nodes and/or hbase region servers are two common examples. In those cases I've monitored "sudo iostat -x 3", specifically the "%util", and when this case is occurring I've seen heavy utilization concurrent with the max latency spiking up. ZK servers need to sync a file to disk on every write operation, maxed out disk utilization (esp on ext3 fs) will cause a spike in max latency as the server has to wait for the sync to complete. I notice you have ~1mil writes and ~1mil reads according to the current server stats. It seems you are doing 1 read for every write (so relatively low read to write ratio... not extremely so but something to consider) You might reset the latency using the "srst" 4 letter word http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands then monitor the max latency, the GC history, the "swappyness" of the java process, and esp the disk utilization. You can correlate a jump in latency against these (gc, swap, %util, etc...). useful tool: http://github.com/phunt/zktop Given that you've ruled out the gc (most common), disk utilization would be the next thing to check. Patrick On 06/15/2010 11:20 AM, Jordan Zimmerman wrote: Yes - the session drop happened again. I did the stat. The max latency is huge (I assume that's in ms). Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT Clients: /10.243.14.179:57300[1](queued=0,recved=0,sent=0) /207.111.236.2:51493[1](queued=0,recved=1,sent=0) /10.243.13.191:44414[1](queued=0,recved=0,sent=0) /10.243.13.191:44415[1](queued=0,recved=0,sent=0) /10.244.135.100:58965[1](queued=0,recved=0,sent=0) /10.243.15.192:45057[1](queued=0,recved=0,sent=0) Latency min/avg/max: 0/6/965759 Received: 2203048 Sent: 2202979 Outstanding: 0 Zxid: 0xef2b2 Mode: standalone Node count: 347 On Jun 15, 2010, at 11:02 AM, Ted Dunning wrote: Jordan, Good step to get this info. I have to ask, did you have your disconnect problem last night as well? (just checking) What does the stat command on ZK give you for each server? On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman< jzimmer...@proofpoint.com> wrote: More on this... I ran last night with verbose GC on our client. I analyzed the GC log in gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the problem. NOTE we're running on Amazon EC2.
Re: Debugging help for SessionExpiredException
Yes - the session drop happened again. I did the stat. The max latency is huge (I assume that's in ms). Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT Clients: /10.243.14.179:57300[1](queued=0,recved=0,sent=0) /207.111.236.2:51493[1](queued=0,recved=1,sent=0) /10.243.13.191:44414[1](queued=0,recved=0,sent=0) /10.243.13.191:44415[1](queued=0,recved=0,sent=0) /10.244.135.100:58965[1](queued=0,recved=0,sent=0) /10.243.15.192:45057[1](queued=0,recved=0,sent=0) Latency min/avg/max: 0/6/965759 Received: 2203048 Sent: 2202979 Outstanding: 0 Zxid: 0xef2b2 Mode: standalone Node count: 347 On Jun 15, 2010, at 11:02 AM, Ted Dunning wrote: > Jordan, > > Good step to get this info. > > I have to ask, did you have your disconnect problem last night as well? > (just checking) > > What does the stat command on ZK give you for each server? > > On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman < > jzimmer...@proofpoint.com> wrote: > >> More on this... >> >> I ran last night with verbose GC on our client. I analyzed the GC log in >> gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the >> Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the >> problem. NOTE we're running on Amazon EC2. >> >>
Re: Debugging help for SessionExpiredException
Jordan, Good step to get this info. I have to ask, did you have your disconnect problem last night as well? (just checking) What does the stat command on ZK give you for each server? On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman < jzimmer...@proofpoint.com> wrote: > More on this... > > I ran last night with verbose GC on our client. I analyzed the GC log in > gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the > Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the > problem. NOTE we're running on Amazon EC2. > >
Re: Debugging help for SessionExpiredException
More on this... I ran last night with verbose GC on our client. I analyzed the GC log in gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the problem. NOTE we're running on Amazon EC2. -JZ On Jun 9, 2010, at 11:36 AM, Jordan Zimmerman wrote: > We have a test system using Zookeeper. There is a single Zookeeper server > node and 4 clients. There is very little activity in this system. After a > day's testing we start to see SessionExpiredException on the client. Things > I've tried: > > * Increasing the session timeout to 1 minute > * Making sure all JVMs are running in a 100MB partition > > Any help debugging this problem would be appreciated. What kind of > diagnostics should can I add? Are there more config parameters that I should > try? > > -JZ