How does one recover when a regionserver dies?  We have this problem 
periodically and we basically have to restart hbase or all our jobs die with 
these type of errors:

org.apache.hadoop.hbase.client.RetriesExhaustedException: Trying to contact 
region server c1-s35.blablabla.com:60020 for region 
urlhashv4,F4657B47F9881A42AF88864EC5EA9B27,1307217134729.4fa3defeeaeb59dc56f7ce6f155b2a0b.,
 row 'F471203BA4FF5DD2BD2549308FD81F4A', but failed after 10 attempts.
Exceptions:


Then eventually this results in a general failure with Wrong Region exceptions 
and the whole table seems to go corrupt.  The errors one sees at the 
regionserver level are:

2011-06-22 10:32:35,559 WARN org.apache.hadoop.hbase.regionserver.HRegion: File 
hdfs://c1-m01:54310/hbase/urlhashv4/d3c3f27ac1ce7a2dff35ddf367fe779d/recovered.edits/0000000000097403816
 is zero-length, deleting.
2011-06-22 10:32:35,563 ERROR org.apache.hadoop.hbase.regionserver.HRegion: 
Failed delete of 
hdfs://c1-m01:54310/hbase/urlhashv4/d3c3f27ac1ce7a2dff35ddf367fe779d/recovered.edits/0000000000097403816
2011-06-22 10:33:19,769 WARN org.apache.hadoop.hbase.regionserver.HRegion: File 
hdfs://c1-m01:54310/hbase/urlhashv4/9d0d6214bebdefd5466d0e6918c3630c/recovered.edits/0000000000097403669
 is zero-length, deleting.
2011-06-22 10:33:19,770 ERROR org.apache.hadoop.hbase.regionserver.HRegion: 
Failed delete of 
hdfs://c1-m01:54310/hbase/urlhashv4/9d0d6214bebdefd5466d0e6918c3630c/recovered.edits/0000000000097403669


Shouldn't the master detect deaths and rebalance the regions to other 
regionservers?  Or is there a manual way to do this without having to restart 
the whole thing?

Thanks,

Robert Gonzalez
Maxpoint Interactive



Reply via email to