I've recently had a region server suicide, and am not able to recover from it. I've tried completely stopping the entire cluster and restarting it (including dfs and zk), but the master refuses to recognize the regionservers.
The region servers appear to just be waiting for the master with this in their log file: 2010-10-03 17:40:32,748 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: <10.249.70.255:/hbase,domU-12-31-39-18-1B-05.compute-1.internal,60020,1286127632413>Read ZNode /hbase/master got 10.104.37.247:60000 2010-10-03 17:40:32,749 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 10.104.37.247:60000 that we are up 2010-10-03 17:40:32,862 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver60020 ... and the the master log file just keeps repeating this: 2010-10-03 17:42:15,531 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN 2010-10-03 17:43:15,541 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN After many lines of this sort of thing: 2010-10-03 17:41:05,179 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Split writer thread for region user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295aeec3a7e739fe2. got 11 to process 2010-10-03 17:41:05,180 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Split writer thread for region user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295aeec3a7e739fe2. Applied 11 total edits to user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295ae Followed by many lines of this: 2010-10-03 17:41:24,719 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/user/7b49d357be708d07e6f01843a35286a7/recovered.edits/0000000000075377494 2010-10-03 17:41:24,724 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Closed hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/user/3a58b7adcf049800be83425e75288eeb/recovered.edits/0000000000075377495 As one might expect, attempts to achbase hang, for example: HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version: 0.89.20100924, r1001068, Fri Sep 24 13:55:42 PDT 2010 hbase(main):001:0> list TABLE I'm using CDH3b2 for hdfs and the version of hbase from here: http://people.apache.org/~jdcryans/hbase-0.89.20100924-candidate-1 Any ideas on how I can get the master to recognize the region servers? I'm really just concerned with how to get back up and running. Thank you Matthew
