Hey guys, We've recently replaced a few pieces of HBase's cluster management and coordination with ZooKeeper. One of guys, Andrew Purtell, has a cluster that he throws a lot of load at. Andrew's cluster was getting a lot of SessionExpired events which were causing some havoc. After some discussion on the hbase list and additional testing by Andrew (tweaking things like the session timeout, quorum size, and GC used), we suspect the problem is that the Java GC is starving the ZooKeeper hearbeat thread from executing.
There is a JIRA open on the matter where Joey suggests a solution that has worked for him: https://issues.apache.org/jira/browse/HBASE-1316 We wanted to loop you guys in to see if you have any thoughts/suggestions on the matter. Thanks, -n