I'm running HBase 1.4.4 on EMR. In following your suggestions I realized
that the master is trying to assign the regions to dead/non-existant
region servers. While trying to fix this problem I had killed the EMR
cluster and started a new one. It's still trying to assign some regions
to those region servers in the previous cluster. I tried to manually
move one of the regions to a good region server but I'm getting 'ERROR:
No route to host' when I try to close the region.
I've tried nuking the /hbase directory in Zookeeper but that didn't seem
to help so I'm not sure where it's getting these references from.
-Austin
On 09/30/2018 02:38 PM, Josh Elser wrote:
First off: You're on EMR? What version of HBase you're using? (Maybe
Zach or Stephen can help here too). Can you figure out the
RegionServer(s) which are stuck opening these PENDING_OPEN regions?
Can you get a jstack/thread-dump from those RS's?
In terms of how the system is supposed to work: the PENDING_OPEN state
for a Region "R" means: the active Master has asked a RegionServer to
open R. That RS should have an active thread which is trying to open
R. Upon success, the state of R will move from PENDING_OPEN to OPEN.
Otherwise, the Master will try to assign R again.
In absence of any custom coprocessors (including Phoenix), this would
mean some subset of RegionServers are in a bad state. Figuring out
what those RS's are trying to do will be the next step in figuring out
why they're stuck like that. It might be obvious from the UI, or you
might have to look at hbase:meta or the master log to figure it out.
One caveat, it's possible that the Master is just not doing the right
thing as described above. If the steps described above don't seem to
be matching what your system is doing, you might have to look closer
at the Master log. Make sure you have DEBUG on to get anything of
value out of the system.
On 9/30/18 1:43 PM, Austin Heyne wrote:
I'm having a strange problem that my usual bag of tricks is having
trouble sorting out. On Friday queries stoped returning for some
reason. You could see them come in and there would be a resource
utilization spike that would fade out after an appropriate amount of
time, however, the query would never actually return. This could be
related to our client code but I wasn't able to dig into it since
this was the middle of the day on a production system. Since this had
happened before and bouncing HBase cleared it up, I proceeded to
disable tables and restart HBase. Upon bringing HBase backup a few
thousand regions are stuck in PENDING_OPEN state and refuse to move
from that state. I've run hbck -repair a number of times under a few
conditions (even the offline repair), have deleted everything out of
/hbase in zookeeper and even migrated the cluster to new servers
(EMR) with no luck. When I spin HBase up the regions are already at
PENDING_OPEN even though the tables are offline.
Any ideas on what's going on here would be a huge help.
Thanks,
Austin
--
Austin L. Heyne