Hi everyone,

We recently ran into an issue during an upgrade of our cluster which took
down most if not all of the cluster services at some point (hmaster,
regionservers, zookeeper). Essentially none of our services could recovery
from this kind of cluster wide failure. Restarting our services fixed the
issue but we wanted to see if we could find a way to recover from this. We
managed to track this down and found that the HConnection everyone was
using had become closed/aborted and they couldn't seem to be able to
retrieve a new valid HConnection even after all of the services had come
back online. In our case we were using an HTablePool but I would expect
something to similar to happen even if you were using individual HTables.

If my understanding is correct it seems HConnectionManager holds a global
cache of HConnections which are used by HTable and therefore HTablePool. It
holds one HConnection per unique configuration via HConnectionKey. An
HConnection will only be evicted from this global cache when,

* The number of clients referencing the HConnection are 0
* A client specifically tells HConnectionManager the connection is stale

Unfortunately it does not seem like HTable or HTablePool have any logic to
tell the HConnectionManager the connection is stale and I don't believe you
can rely on all of the clients giving back the connection at the same time
in order to solve this issue.

So I have a couple questions,

1. Since HConnectionImplementation understands if it is being managed or
not, would it make sense for it to remove itself from the
HConnectionManager cache when abort(String, Throwable) is called via
deleteStaleConnection(..)? Notice that the close() method currently does
something similar.

2. Should HConnectionManager delete connections that are closed/aborted and
have been passed back to it via the deleteConnection methods?

Although I wish I had a junit that could show this, I also believe that a
HConnectionImplementation can become aborted during construction. We saw
this happening while the cluster services were down, HConnectionManager
would retrieve a new HConnection but it would come to us already
closed/aborted.

There are a couple other issues with HTablePool[1] and dealing with this
issue but these behaviors seem like they would need to be addressed first.

[1] - https://issues.apache.org/jira/browse/HBASE-6956

-Bryan

Reply via email to