Re: zookeeper watch triggered multiple times on same event

2010-06-15 Thread Patrick Hunt
I don't think this should be possible (if it happens it's a bug in zk). 
Perhaps, for some reason, there really are 2 change actions (children 
created, or the same child created twice) and not just one?


Re-registering the watch "inside the watch" is fine. The server sends 
watch notifications as one way messages, when it notices a znode child 
list has changed it fires off change messages to all the registered 
clients. The client then receives the notification and calls the handler.


Patrick

On 06/15/2010 05:47 PM, Jun Rao wrote:

Hi,

I get a quick question on ZK 3.2.2. Here is a sequence of events during a
test:
1. client 1 creates an ephemeral node under /a
2. client 1 sets a watch using getChildren on /a
3. client 2 creates an ephemeral node under /a
4. client 1's watch gets triggered (a node change event). Inside the watch,
client 1 does getChildren on /a and sets the watch.
5. client 1's watch gets triggered again (a node change event)

My question is why the same node change event gets triggered twice. It seems
that step 5 shouldn't have happened.

Thanks,

Jun



zookeeper watch triggered multiple times on same event

2010-06-15 Thread Jun Rao
Hi,

I get a quick question on ZK 3.2.2. Here is a sequence of events during a
test:
1. client 1 creates an ephemeral node under /a
2. client 1 sets a watch using getChildren on /a
3. client 2 creates an ephemeral node under /a
4. client 1's watch gets triggered (a node change event). Inside the watch,
client 1 does getChildren on /a and sets the watch.
5. client 1's watch gets triggered again (a node change event)

My question is why the same node change event gets triggered twice. It seems
that step 5 shouldn't have happened.

Thanks,

Jun


Re: Debugging help for SessionExpiredException

2010-06-15 Thread Patrick Hunt
I'm not very experienced personally with running zk on ec2 smalls, Ted 
usually has the ec2 related insight. Given these boxes are not loaded or 
lightly loaded, and you've ruled out gc/swap, the only thing I can think 
of is that something is going on under the covers at the vm level that's 
causing the high latency you're seeing.


You're seeing 15 _minutes_ max latency. I can't think of what would 
cause that inside zk. Any chance that the VM is shutting down or 
"freezing" during that period? I dont' know. Are you monitoring that 
system from a second system? Perhaps that might shed some light (monitor 
the cpu/disk activity using some monitoring tool like ganglia, nagios, 
etc... or even more primitive, perhaps doing a ping to that system and 
tracking the round trip time/packet loss, dump to a file and review the 
next day, etc...)


Patrick

On 06/15/2010 03:59 PM, Jordan Zimmerman wrote:

They're small instances. The thing is that these machines are doing
next to no work. We're just running simple little tests. The session
expiration has not happened while I've been watching. It tends to
happen over night.

-JZ

On Jun 15, 2010, at 1:50 PM, Ted Dunning wrote:


As usual, the ZK team provides the best feedback.

I would be bold enough to ask what kind of ec2 instances you are
running on.  Small instances are small chunks of larger machines
and are sometimes subject to competition for resources from the
other tenants.

On Tue, Jun 15, 2010 at 12:30 PM, Patrick Hunt
wrote: 3) under-provisioned virtual machines (ie vmware)

...

Given that you've ruled out the gc (most common), disk utilization
would be the next thing to check.






Re: Debugging help for SessionExpiredException

2010-06-15 Thread Jordan Zimmerman
They're small instances. The thing is that these machines are doing next to no 
work. We're just running simple little tests. The session expiration has not 
happened while I've been watching. It tends to happen over night.

-JZ

On Jun 15, 2010, at 1:50 PM, Ted Dunning wrote:

> As usual, the ZK team provides the best feedback.
> 
> I would be bold enough to ask what kind of ec2 instances you are running on.  
> Small instances are small chunks of larger machines and are sometimes subject 
> to competition for resources from the other tenants.
> 
> On Tue, Jun 15, 2010 at 12:30 PM, Patrick Hunt  wrote:
> 3) under-provisioned virtual machines (ie vmware)
> 
> ...
> 
> Given that you've ruled out the gc (most common), disk utilization would be 
> the next thing to check.
> 



Re: Debugging help for SessionExpiredException

2010-06-15 Thread Ted Dunning
As usual, the ZK team provides the best feedback.

I would be bold enough to ask what kind of ec2 instances you are running on.
 Small instances are small chunks of larger machines and are sometimes
subject to competition for resources from the other tenants.

On Tue, Jun 15, 2010 at 12:30 PM, Patrick Hunt  wrote:

> 3) under-provisioned virtual machines (ie vmware)
>
> ...
>
> Given that you've ruled out the gc (most common), disk utilization would be
> the next thing to check.
>


Re: Debugging help for SessionExpiredException

2010-06-15 Thread Patrick Hunt

Yes, 965 seconds is huge.

The times I've seen such huge latencies are (in order of frequency seen):

1) when the java process gc's, swaps, or both

and/or

2) disk utilization on the ZK server is high

and/or

3) under-provisioned virtual machines (ie vmware)

Re 2) in some cases we've seen users running multiple IO heavy processes 
on the same host as the zk server. hdfs data nodes and/or hbase region 
servers are two common examples. In those cases I've monitored "sudo 
iostat -x 3", specifically the "%util", and when this case is occurring 
I've seen heavy utilization concurrent with the max latency spiking up. 
ZK servers need to sync a file to disk on every write operation, maxed 
out disk utilization (esp on ext3 fs) will cause a spike in max latency 
as the server has to wait for the sync to complete.


I notice you have ~1mil writes and ~1mil reads according to the current 
server stats. It seems you are doing 1 read for every write (so 
relatively low read to write ratio... not extremely so but something to 
consider)


You might reset the latency using the "srst" 4 letter word
http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_zkCommands
then monitor the max latency, the GC history, the "swappyness" of the 
java process, and esp the disk utilization. You can correlate a jump in 
latency against these (gc, swap, %util, etc...).


useful tool: http://github.com/phunt/zktop

Given that you've ruled out the gc (most common), disk utilization would 
be the next thing to check.


Patrick

On 06/15/2010 11:20 AM, Jordan Zimmerman wrote:

Yes - the session drop happened again. I did the stat. The max latency is huge 
(I assume that's in ms).

Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT
Clients:
  /10.243.14.179:57300[1](queued=0,recved=0,sent=0)
  /207.111.236.2:51493[1](queued=0,recved=1,sent=0)
  /10.243.13.191:44414[1](queued=0,recved=0,sent=0)
  /10.243.13.191:44415[1](queued=0,recved=0,sent=0)
  /10.244.135.100:58965[1](queued=0,recved=0,sent=0)
  /10.243.15.192:45057[1](queued=0,recved=0,sent=0)

Latency min/avg/max: 0/6/965759
Received: 2203048
Sent: 2202979
Outstanding: 0
Zxid: 0xef2b2
Mode: standalone
Node count: 347


On Jun 15, 2010, at 11:02 AM, Ted Dunning wrote:


Jordan,

Good step to get this info.

I have to ask, did you have your disconnect problem last night as well?
(just checking)

What does the stat command on ZK give you for each server?

On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman<
jzimmer...@proofpoint.com>  wrote:


More on this...

I ran last night with verbose GC on our client. I analyzed the GC log in
gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the
Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the
problem. NOTE we're running on Amazon EC2.






Re: Debugging help for SessionExpiredException

2010-06-15 Thread Jordan Zimmerman
Yes - the session drop happened again. I did the stat. The max latency is huge 
(I assume that's in ms).

Zookeeper version: 3.3.0-925362, built on 03/19/2010 18:38 GMT
Clients:
 /10.243.14.179:57300[1](queued=0,recved=0,sent=0)
 /207.111.236.2:51493[1](queued=0,recved=1,sent=0)
 /10.243.13.191:44414[1](queued=0,recved=0,sent=0)
 /10.243.13.191:44415[1](queued=0,recved=0,sent=0)
 /10.244.135.100:58965[1](queued=0,recved=0,sent=0)
 /10.243.15.192:45057[1](queued=0,recved=0,sent=0)

Latency min/avg/max: 0/6/965759
Received: 2203048
Sent: 2202979
Outstanding: 0
Zxid: 0xef2b2
Mode: standalone
Node count: 347


On Jun 15, 2010, at 11:02 AM, Ted Dunning wrote:

> Jordan,
> 
> Good step to get this info.
> 
> I have to ask, did you have your disconnect problem last night as well?
> (just checking)
> 
> What does the stat command on ZK give you for each server?
> 
> On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman <
> jzimmer...@proofpoint.com> wrote:
> 
>> More on this...
>> 
>> I ran last night with verbose GC on our client. I analyzed the GC log in
>> gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the
>> Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the
>> problem. NOTE we're running on Amazon EC2.
>> 
>> 



Re: Debugging help for SessionExpiredException

2010-06-15 Thread Ted Dunning
Jordan,

Good step to get this info.

I have to ask, did you have your disconnect problem last night as well?
 (just checking)

What does the stat command on ZK give you for each server?

On Tue, Jun 15, 2010 at 10:33 AM, Jordan Zimmerman <
jzimmer...@proofpoint.com> wrote:

> More on this...
>
> I ran last night with verbose GC on our client. I analyzed the GC log in
> gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the
> Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the
> problem. NOTE we're running on Amazon EC2.
>
>


Re: Debugging help for SessionExpiredException

2010-06-15 Thread Jordan Zimmerman
More on this...

I ran last night with verbose GC on our client. I analyzed the GC log in 
gchisto and 99% of the GCs are 1 or 2 ms. The longest gc is 30 ms. On the 
Zookeeper server side, the longest gc is 130 ms. So, I submit, GC is not the 
problem. NOTE we're running on Amazon EC2. 

-JZ

On Jun 9, 2010, at 11:36 AM, Jordan Zimmerman wrote:

> We have a test system using Zookeeper. There is a single Zookeeper server 
> node and 4 clients. There is very little activity in this system. After a 
> day's testing we start to see SessionExpiredException on the client. Things 
> I've tried:
> 
> * Increasing the session timeout to 1 minute
> * Making sure all JVMs are running in a 100MB partition
> 
> Any help debugging this problem would be appreciated. What kind of 
> diagnostics should can I add? Are there more config parameters that I should 
> try?
> 
> -JZ