Hello, I've run into an interesting HBase failover scenario recently and am seeking some advice on how to work around the problem.
First of all, I'm running CDH2 (0.20.1+169.89) and HBase 0.20.3 on a 70 node cluster. One of the nodes in the cluster appears to have a bad disk or disk controller. Hadoop identified the failing node and marked it as dead in the HDFS admin page as well as the jobtracker. The node has not completely failed since I can ping it, but ssh connections are failing. The regionserver process on this same node has apparently not completely failed either. The HBase master still thinks it is alive, and the node is registered in Zookeeper. Clients hitting regions hosted on this particular region server are hanging/timing out, which is less than ideal. Any thoughts on thoughts on how to configure HBase to be more sensitive to this type of error? Also, is there any way short of restarting HBase that I can force these regions to be reassigned to another regionserver if I don't have physical access (or remote console) to stop the regionserver process on the failing node. The master did not report any errors in its log related to the failing node. I'm currently waiting on operations to get me the regionserver logs if they can be recovered. Regards, Nathan Harkenrider
