Re: Debugging help for SessionExpiredException
On 06/09/2010 03:35 PM, Lei Zhang wrote: We've consistently run into issues with vmware workstation (CentOS as guest OS) on Windows host: just by leaving the cluster idle over night leads to zk session expire issue. My theory is: windows may have gone to hibernation, the zk heartbeat logic hibernates, session expire exception is thrown the moment windows is taken out of hibernation. That sounds like a possible scenario. On EC2 (still CentOS as guest OS), we consistently run into zk session expire issue when our cluster is under heavy load. I am planning to raise scheduling priority of zk server, but haven't done testing. Before you take any action you might examine a few things to identify what's biting you: this has some good general detail on issues other users have seen: http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting In particular you might look at GC/swapping on your clients, that's the most common case we see for session expiration (apart from the obvious -- network level connectivity failures). In one case I remember there was very heavy network load for a period of time once per day, this was causing some issue on the switches which would result in occassional session expiration, but only during this short window. This was pretty hard to track down. Are you monitoring network connectivity in general? Is it possible that temporary network outages are causing this? Perhaps take a look at both your server and client ZK logs, see if the client is seeing anything other than the session expiration (is the client seeing session TIMED OUT for example, this happens when the client doesn't hear back from the server, while session expiration happens because the server doesn't hear from the client). Good luck, Patrick
Re: Debugging help for SessionExpiredException
This can depend on which kind of instance you invoke as well. The smallest instances disappear for short periods of time and that can lead to surprises. On Wed, Jun 9, 2010 at 3:35 PM, Lei Zhang wrote: > On EC2 (still CentOS as guest OS), we consistently run into zk session > expire issue when our cluster is under heavy load. I am planning to raise > scheduling priority of zk server, but haven't done testing. >
Re: Debugging help for SessionExpiredException
We use zookeeper in virtualized environment, both on Amazon EC2 and on Vmware Workstation on local machines. We've consistently run into issues with vmware workstation (CentOS as guest OS) on Windows host: just by leaving the cluster idle over night leads to zk session expire issue. My theory is: windows may have gone to hibernation, the zk heartbeat logic hibernates, session expire exception is thrown the moment windows is taken out of hibernation. On EC2 (still CentOS as guest OS), we consistently run into zk session expire issue when our cluster is under heavy load. I am planning to raise scheduling priority of zk server, but haven't done testing.
Re: Debugging help for SessionExpiredException
On Wed, Jun 9, 2010 at 2:47 PM, Patrick Hunt wrote: > My guess is that your client is gcing for long periods of time - you can > rule this in/out by turning on gc logging in your clients and then viewing > the results after another such incident happens (try gchisto for graphical > view) >From recent experience, the incantation: -verbose:gc -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps at JVM startup time will give you very good information about what the garbage collector is doing. Steve -- Stephen Green http://thesearchguy.wordpress.com
Re: Debugging help for SessionExpiredException
"100mb partition"? sounds like virtualization. resource starvation (worse in virtualized env) is a common cause of this. Are your clients gcing/swapping at all? If a client gc's for long periods of time the heartbeat thread won't be able to run and the server will expire the session. There is a min/max cap that the server places on the client timeouts (it's negotiated), check the client log for detail on what timeout it negotiated (logged in 3.3 releases) take a look at this and see if you can make progress: http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting My guess is that your client is gcing for long periods of time - you can rule this in/out by turning on gc logging in your clients and then viewing the results after another such incident happens (try gchisto for graphical view) Patrick On 06/09/2010 11:36 AM, Jordan Zimmerman wrote: We have a test system using Zookeeper. There is a single Zookeeper server node and 4 clients. There is very little activity in this system. After a day's testing we start to see SessionExpiredException on the client. Things I've tried: * Increasing the session timeout to 1 minute * Making sure all JVMs are running in a 100MB partition Any help debugging this problem would be appreciated. What kind of diagnostics should can I add? Are there more config parameters that I should try? -JZ
Debugging help for SessionExpiredException
We have a test system using Zookeeper. There is a single Zookeeper server node and 4 clients. There is very little activity in this system. After a day's testing we start to see SessionExpiredException on the client. Things I've tried: * Increasing the session timeout to 1 minute * Making sure all JVMs are running in a 100MB partition Any help debugging this problem would be appreciated. What kind of diagnostics should can I add? Are there more config parameters that I should try? -JZ