We are experiencing a problem where RS are locking up and not processing any requests. Restarting the RS will fix the problem and operations will continue as normal. We are experiencing this issue under load and on two different clusters. We are importing existing data via the hbase mapreduce import job to a cluster (h1) and then replicating it to a second cluster (h2). Both the clusters h1 and h2 are experiencing the same problem.
Here are some of the symptoms and logs to go with it. When a RS locks up we find a table with one region owned by the RS. We attempt to scan the table (with the hbase shell) since the table has only one region on the locked RS it hangs forever until the RS is restarted. Additionally the HMaster gets a SocketTimeoutException when communicating with the RS on port 60020. Log for HMaster: http://pastebin.com/7yMWWNNR We ran a jstack on the both the RS process and the hbase shell process trying to do the scan. Jstack log for RS: http://pastebin.com/9Y9t5ERE Jstack log for scan (hbase shell): http://pastebin.com/YVTbNDu7 We don't see any errors in the region server logs. We have not been able to figure out what is causing the RS to lock up. We have checked open file limits, socket limits and basic network connectivity between the machines. We are well under the limits and can create new connections between machines unrelated to HBase; so the problem is specific to talking to region servers and not network connectivity. Another reason we believe its not a network connectivity issue is that the RS are able to keep their heartbeats with ZooKeeper. We have also checked logs on NameNodes and DataNodes and are not seeing any issues. We are running HBase version 0.92 (cdh4) -- -Nathaniel Cook
