Hello,

I've run into an interesting HBase failover scenario recently and am seeking
some advice on how to work around the problem.

First of all, I'm running CDH2 (0.20.1+169.89) and HBase 0.20.3 on a 70 node
cluster. One of the nodes in the cluster appears to have a bad disk or disk
controller. Hadoop identified the failing node and marked it as dead in the
HDFS admin page as well as the jobtracker. The node has not completely
failed since I can ping it, but ssh connections are failing. The
regionserver process on this same node has apparently not completely failed
either. The HBase master still thinks it is alive, and the node is
registered in Zookeeper. Clients hitting regions hosted on this particular
region server are hanging/timing out, which is less than ideal. Any thoughts
on thoughts on how to configure HBase to be more sensitive to this type of
error? Also, is there any way short of restarting HBase that I can force
these regions to be reassigned to another regionserver if I don't have
physical access (or remote console) to stop the regionserver process on the
failing node.

The master did not report any errors in its log related to the failing node.
I'm currently waiting on operations to get me the regionserver logs if they
can be recovered.

Regards,

Nathan Harkenrider

Reply via email to