Hi everyone, We recently ran into an issue during an upgrade of our cluster which took down most if not all of the cluster services at some point (hmaster, regionservers, zookeeper). Essentially none of our services could recovery from this kind of cluster wide failure. Restarting our services fixed the issue but we wanted to see if we could find a way to recover from this. We managed to track this down and found that the HConnection everyone was using had become closed/aborted and they couldn't seem to be able to retrieve a new valid HConnection even after all of the services had come back online. In our case we were using an HTablePool but I would expect something to similar to happen even if you were using individual HTables.
If my understanding is correct it seems HConnectionManager holds a global cache of HConnections which are used by HTable and therefore HTablePool. It holds one HConnection per unique configuration via HConnectionKey. An HConnection will only be evicted from this global cache when, * The number of clients referencing the HConnection are 0 * A client specifically tells HConnectionManager the connection is stale Unfortunately it does not seem like HTable or HTablePool have any logic to tell the HConnectionManager the connection is stale and I don't believe you can rely on all of the clients giving back the connection at the same time in order to solve this issue. So I have a couple questions, 1. Since HConnectionImplementation understands if it is being managed or not, would it make sense for it to remove itself from the HConnectionManager cache when abort(String, Throwable) is called via deleteStaleConnection(..)? Notice that the close() method currently does something similar. 2. Should HConnectionManager delete connections that are closed/aborted and have been passed back to it via the deleteConnection methods? Although I wish I had a junit that could show this, I also believe that a HConnectionImplementation can become aborted during construction. We saw this happening while the cluster services were down, HConnectionManager would retrieve a new HConnection but it would come to us already closed/aborted. There are a couple other issues with HTablePool[1] and dealing with this issue but these behaviors seem like they would need to be addressed first. [1] - https://issues.apache.org/jira/browse/HBASE-6956 -Bryan
