Hey Nathan,

I just filed a JIRA to attack this general problem:
https://issues.apache.org/jira/browse/HBASE-2940

I think we'll see issues like this more and more as people start to run
HBase on larger and larger clusters.

Thanks
-Todd

On Sun, Aug 29, 2010 at 12:37 PM, Nathan Harkenrider <
[email protected]> wrote:

> Hello,
>
> I've run into an interesting HBase failover scenario recently and am
> seeking
> some advice on how to work around the problem.
>
> First of all, I'm running CDH2 (0.20.1+169.89) and HBase 0.20.3 on a 70
> node
> cluster. One of the nodes in the cluster appears to have a bad disk or disk
> controller. Hadoop identified the failing node and marked it as dead in the
> HDFS admin page as well as the jobtracker. The node has not completely
> failed since I can ping it, but ssh connections are failing. The
> regionserver process on this same node has apparently not completely failed
> either. The HBase master still thinks it is alive, and the node is
> registered in Zookeeper. Clients hitting regions hosted on this particular
> region server are hanging/timing out, which is less than ideal. Any
> thoughts
> on thoughts on how to configure HBase to be more sensitive to this type of
> error? Also, is there any way short of restarting HBase that I can force
> these regions to be reassigned to another regionserver if I don't have
> physical access (or remote console) to stop the regionserver process on the
> failing node.
>
> The master did not report any errors in its log related to the failing
> node.
> I'm currently waiting on operations to get me the regionserver logs if they
> can be recovered.
>
> Regards,
>
> Nathan Harkenrider
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to