On 06/09/2010 03:35 PM, Lei Zhang wrote:
We've consistently run into issues with vmware workstation (CentOS as guest
OS) on Windows host: just by leaving the cluster idle over night leads to zk
session expire issue. My theory is: windows may have gone to hibernation,
the zk heartbeat logic hibernates, session expire exception is thrown the
moment windows is taken out of hibernation.
That sounds like a possible scenario.
On EC2 (still CentOS as guest OS), we consistently run into zk session
expire issue when our cluster is under heavy load. I am planning to raise
scheduling priority of zk server, but haven't done testing.
Before you take any action you might examine a few things to identify
what's biting you:
this has some good general detail on issues other users have seen:
http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting
In particular you might look at GC/swapping on your clients, that's the
most common case we see for session expiration (apart from the obvious
-- network level connectivity failures). In one case I remember there
was very heavy network load for a period of time once per day, this was
causing some issue on the switches which would result in occassional
session expiration, but only during this short window. This was pretty
hard to track down. Are you monitoring network connectivity in general?
Is it possible that temporary network outages are causing this? Perhaps
take a look at both your server and client ZK logs, see if the client is
seeing anything other than the session expiration (is the client seeing
session TIMED OUT for example, this happens when the client doesn't hear
back from the server, while session expiration happens because the
server doesn't hear from the client).
Good luck,
Patrick