RS not processing any requests

Nathaniel Cook Wed, 05 Sep 2012 14:58:43 -0700

We are experiencing a problem where RS are locking up and not processing
any requests. Restarting the RS will fix the problem and operations will
continue as normal. We are experiencing this issue under load and on two
different clusters. We are importing existing data via the hbase mapreduce
import job to a cluster (h1) and then replicating it to a second cluster
(h2). Both the clusters h1 and h2 are experiencing the same problem.


Here are some of the symptoms and logs to go with it. When a RS locks up we
find a table with one region owned by the RS. We attempt to scan the table
(with the hbase shell) since the table has only one region on the locked RS
it hangs forever until the RS is restarted. Additionally the HMaster gets a
SocketTimeoutException when communicating with the RS on port 60020.

Log for HMaster:
http://pastebin.com/7yMWWNNR

We ran a jstack on the both the RS process and the hbase shell process
trying to do the scan.

Jstack log for RS:
http://pastebin.com/9Y9t5ERE

Jstack log for scan (hbase shell):
http://pastebin.com/YVTbNDu7

We don't see any errors in the region server logs.

We have not been able to figure out what is causing the RS to lock up. We
have checked open file limits, socket limits and basic network connectivity
between the machines. We are well under the limits and can create new
connections between machines unrelated to HBase; so the problem is specific
to talking to region servers and not network connectivity. Another reason
we believe its not a network connectivity issue is that the RS are able to
keep their heartbeats with ZooKeeper. We have also checked logs on
NameNodes and DataNodes and are not seeing any issues.

We are running HBase version 0.92 (cdh4)

-- 
-Nathaniel Cook

RS not processing any requests

Reply via email to