This is good to know. It will allow us to try an replicate the situation, which we haven't been able to do.

I'm hoping we can come up with something that we can proactively do to address this...


Nitay wrote:
Also, I should mention that some of the errors Andrew was seeing are related

I see this kind of stuff:

2009-04-07 17:58:13,344 - WARN  [NIOServerCxn.Factory:2181:
nioserverc...@417] - Exception
causing close of session 0x2208296c38e0000 due to Read error

and bye bye HRS ephemeral znodes, which triggers
(currently) HBASE-1314.

This I think is ZOOKEEPER-344

  - Andy

On Wed, Apr 8, 2009 at 12:39 PM, Nitay <> wrote:

Hey guys,

We've recently replaced a few pieces of HBase's cluster management and
coordination with ZooKeeper. One of guys, Andrew Purtell, has a cluster that
he throws a lot of load at. Andrew's cluster was getting a lot of
SessionExpired events which were causing some havoc. After some discussion
on the hbase list and additional testing by Andrew (tweaking things like the
session timeout, quorum size, and GC used), we suspect the problem is that
the Java GC is starving the ZooKeeper hearbeat thread from executing.

There is a JIRA open on the matter where Joey suggests a solution that has
worked for him:

We wanted to loop you guys in to see if you have any thoughts/suggestions
on the matter.


