On 06/09/2010 03:35 PM, Lei Zhang wrote:

We've consistently run into issues with vmware workstation (CentOS as guest
OS) on Windows host: just by leaving the cluster idle over night leads to zk
session expire issue. My theory is: windows may have gone to hibernation,
the zk heartbeat logic hibernates, session expire exception is thrown the
moment windows is taken out of hibernation.


That sounds like a possible scenario.

On EC2 (still CentOS as guest OS), we consistently run into zk session
expire issue when our cluster is under heavy load. I am planning to raise
scheduling priority of zk server, but haven't done testing.


Before you take any action you might examine a few things to identify what's biting you:

this has some good general detail on issues other users have seen:
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting

In particular you might look at GC/swapping on your clients, that's the most common case we see for session expiration (apart from the obvious -- network level connectivity failures). In one case I remember there was very heavy network load for a period of time once per day, this was causing some issue on the switches which would result in occassional session expiration, but only during this short window. This was pretty hard to track down. Are you monitoring network connectivity in general? Is it possible that temporary network outages are causing this? Perhaps take a look at both your server and client ZK logs, see if the client is seeing anything other than the session expiration (is the client seeing session TIMED OUT for example, this happens when the client doesn't hear back from the server, while session expiration happens because the server doesn't hear from the client).

Good luck,

Patrick

Reply via email to