I've recently had a region server suicide, and am not able to recover from it.  
I've tried completely stopping the entire cluster and restarting it (including 
dfs and zk), but the master refuses to recognize the regionservers.  

The region servers appear to just be waiting for the master with this in their 
log file: 

2010-10-03 17:40:32,748 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: 
<10.249.70.255:/hbase,domU-12-31-39-18-1B-05.compute-1.internal,60020,1286127632413>Read
 ZNode /hbase/master got 10.104.37.247:60000
2010-10-03 17:40:32,749 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
10.104.37.247:60000 that we are up
2010-10-03 17:40:32,862 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
Installed shutdown hook thread: Shutdownhook:regionserver60020

... and the the master log file just keeps repeating this: 

2010-10-03 17:42:15,531 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN
2010-10-03 17:43:15,541 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN

After many lines of this sort of thing: 

2010-10-03 17:41:05,179 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Split writer thread for region 
user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295aeec3a7e739fe2. got 11 to 
process
2010-10-03 17:41:05,180 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Split writer thread for region 
user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295aeec3a7e739fe2. Applied 
11 total edits to user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295ae

Followed by many lines of this: 

2010-10-03 17:41:24,719 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Closed 
hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/user/7b49d357be708d07e6f01843a35286a7/recovered.edits/0000000000075377494
2010-10-03 17:41:24,724 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Closed 
hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/user/3a58b7adcf049800be83425e75288eeb/recovered.edits/0000000000075377495

As one might expect, attempts to achbase hang, for example: 

HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.89.20100924, r1001068, Fri Sep 24 13:55:42 PDT 2010

hbase(main):001:0> list
TABLE 


I'm using CDH3b2 for hdfs and the version of hbase from here:  
http://people.apache.org/~jdcryans/hbase-0.89.20100924-candidate-1

Any ideas on how I can get the master to recognize the region servers?  I'm 
really just concerned with how to get back up and running.  

Thank you

Matthew

Reply via email to