Hey Nathan, I just filed a JIRA to attack this general problem: https://issues.apache.org/jira/browse/HBASE-2940
I think we'll see issues like this more and more as people start to run HBase on larger and larger clusters. Thanks -Todd On Sun, Aug 29, 2010 at 12:37 PM, Nathan Harkenrider < [email protected]> wrote: > Hello, > > I've run into an interesting HBase failover scenario recently and am > seeking > some advice on how to work around the problem. > > First of all, I'm running CDH2 (0.20.1+169.89) and HBase 0.20.3 on a 70 > node > cluster. One of the nodes in the cluster appears to have a bad disk or disk > controller. Hadoop identified the failing node and marked it as dead in the > HDFS admin page as well as the jobtracker. The node has not completely > failed since I can ping it, but ssh connections are failing. The > regionserver process on this same node has apparently not completely failed > either. The HBase master still thinks it is alive, and the node is > registered in Zookeeper. Clients hitting regions hosted on this particular > region server are hanging/timing out, which is less than ideal. Any > thoughts > on thoughts on how to configure HBase to be more sensitive to this type of > error? Also, is there any way short of restarting HBase that I can force > these regions to be reassigned to another regionserver if I don't have > physical access (or remote console) to stop the regionserver process on the > failing node. > > The master did not report any errors in its log related to the failing > node. > I'm currently waiting on operations to get me the regionserver logs if they > can be recovered. > > Regards, > > Nathan Harkenrider > -- Todd Lipcon Software Engineer, Cloudera
