Re: Debugging help for SessionExpiredException

2010-06-09 Thread Patrick Hunt


On 06/09/2010 03:35 PM, Lei Zhang wrote:


We've consistently run into issues with vmware workstation (CentOS as guest
OS) on Windows host: just by leaving the cluster idle over night leads to zk
session expire issue. My theory is: windows may have gone to hibernation,
the zk heartbeat logic hibernates, session expire exception is thrown the
moment windows is taken out of hibernation.



That sounds like a possible scenario.


On EC2 (still CentOS as guest OS), we consistently run into zk session
expire issue when our cluster is under heavy load. I am planning to raise
scheduling priority of zk server, but haven't done testing.



Before you take any action you might examine a few things to identify 
what's biting you:


this has some good general detail on issues other users have seen:
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

In particular you might look at GC/swapping on your clients, that's the 
most common case we see for session expiration (apart from the obvious 
-- network level connectivity failures). In one case I remember there 
was very heavy network load for a period of time once per day, this was 
causing some issue on the switches which would result in occassional 
session expiration, but only during this short window. This was pretty 
hard to track down. Are you monitoring network connectivity in general? 
Is it possible that temporary network outages are causing this? Perhaps 
take a look at both your server and client ZK logs, see if the client is 
seeing anything other than the session expiration (is the client seeing 
session TIMED OUT for example, this happens when the client doesn't hear 
back from the server, while session expiration happens because the 
server doesn't hear from the client).


Good luck,

Patrick


Re: Debugging help for SessionExpiredException

2010-06-09 Thread Ted Dunning
This can depend on which kind of instance you invoke as well.  The smallest
instances disappear for short periods of time and that can lead to
surprises.

On Wed, Jun 9, 2010 at 3:35 PM, Lei Zhang  wrote:

> On EC2 (still CentOS as guest OS), we consistently run into zk session
> expire issue when our cluster is under heavy load. I am planning to raise
> scheduling priority of zk server, but haven't done testing.
>


Re: Debugging help for SessionExpiredException

2010-06-09 Thread Lei Zhang
We use zookeeper in virtualized environment, both on Amazon EC2 and on
Vmware Workstation on local machines.

We've consistently run into issues with vmware workstation (CentOS as guest
OS) on Windows host: just by leaving the cluster idle over night leads to zk
session expire issue. My theory is: windows may have gone to hibernation,
the zk heartbeat logic hibernates, session expire exception is thrown the
moment windows is taken out of hibernation.

On EC2 (still CentOS as guest OS), we consistently run into zk session
expire issue when our cluster is under heavy load. I am planning to raise
scheduling priority of zk server, but haven't done testing.


Re: Debugging help for SessionExpiredException

2010-06-09 Thread Stephen Green
On Wed, Jun 9, 2010 at 2:47 PM, Patrick Hunt  wrote:
> My guess is that your client is gcing for long periods of time - you can
> rule this in/out by turning on gc logging in your clients and then viewing
> the results after another such incident happens (try gchisto for graphical
> view)

>From recent experience, the incantation:

-verbose:gc -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps

at JVM startup time will give you very good information about what the
garbage collector is doing.

Steve
-- 
Stephen Green
http://thesearchguy.wordpress.com


Re: Debugging help for SessionExpiredException

2010-06-09 Thread Patrick Hunt
"100mb partition"? sounds like virtualization. resource starvation 
(worse in virtualized env) is a common cause of this. Are your clients 
gcing/swapping at all? If a client gc's for long periods of time the 
heartbeat thread won't be able to run and the server will expire the 
session. There is a min/max cap that the server places on the client 
timeouts (it's negotiated), check the client log for detail on what 
timeout it negotiated (logged in 3.3 releases)


take a look at this and see if you can make progress:
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

My guess is that your client is gcing for long periods of time - you can 
rule this in/out by turning on gc logging in your clients and then 
viewing the results after another such incident happens (try gchisto for 
graphical view)


Patrick

On 06/09/2010 11:36 AM, Jordan Zimmerman wrote:

We have a test system using Zookeeper. There is a single Zookeeper
server node and 4 clients. There is very little activity in this
system. After a day's testing we start to see SessionExpiredException
on the client. Things I've tried:

* Increasing the session timeout to 1 minute * Making sure all JVMs
are running in a 100MB partition

Any help debugging this problem would be appreciated. What kind of
diagnostics should can I add? Are there more config parameters that I
should try?

-JZ


Debugging help for SessionExpiredException

2010-06-09 Thread Jordan Zimmerman
We have a test system using Zookeeper. There is a single Zookeeper server node 
and 4 clients. There is very little activity in this system. After a day's 
testing we start to see SessionExpiredException on the client. Things I've 
tried:

* Increasing the session timeout to 1 minute
* Making sure all JVMs are running in a 100MB partition

Any help debugging this problem would be appreciated. What kind of diagnostics 
should can I add? Are there more config parameters that I should try?

-JZ