>From the brief looks of it, it seems that the master is splitting the
log files from the dead region server. It will do that while the
cluster is running and will keep answering the other region servers,
but if you restart HBase then when the master starts it will split
everything before starting to take region server checkins. Just let
the master finish it's job. Look for this message that tells you which
region server's hlogs are being split:
LOG.info("Splitting " + logfiles.length + " hlog(s) in " + srcDir.toString());
Then this message will show when it's done:
LOG.info("hlog file splitting completed in " + (endMillis - millis) +
" millis for " + srcDir.toString());
J-D
On Sun, Oct 3, 2010 at 10:56 AM, Matthew LeMieux <[email protected]> wrote:
> I've recently had a region server suicide, and am not able to recover from
> it. I've tried completely stopping the entire cluster and restarting it
> (including dfs and zk), but the master refuses to recognize the regionservers.
>
> The region servers appear to just be waiting for the master with this in
> their log file:
>
> 2010-10-03 17:40:32,748 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper:
> <10.249.70.255:/hbase,domU-12-31-39-18-1B-05.compute-1.internal,60020,1286127632413>Read
> ZNode /hbase/master got 10.104.37.247:60000
> 2010-10-03 17:40:32,749 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at
> 10.104.37.247:60000 that we are up
> 2010-10-03 17:40:32,862 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook
> thread: Shutdownhook:regionserver60020
>
> ... and the the master log file just keeps repeating this:
>
> 2010-10-03 17:42:15,531 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
> 2010-10-03 17:43:15,541 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
>
> After many lines of this sort of thing:
>
> 2010-10-03 17:41:05,179 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
> Split writer thread for region
> user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295aeec3a7e739fe2. got 11
> to process
> 2010-10-03 17:41:05,180 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
> Split writer thread for region
> user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295aeec3a7e739fe2. Applied
> 11 total edits to user,\x01\x88\xFB\xCA,1281914437530.3901f9eb92c049a295ae
>
> Followed by many lines of this:
>
> 2010-10-03 17:41:24,719 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
> Closed
> hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/user/7b49d357be708d07e6f01843a35286a7/recovered.edits/0000000000075377494
> 2010-10-03 17:41:24,724 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
> Closed
> hdfs://domU-12-31-39-03-44-F1.compute-1.internal:9000/hbase/user/3a58b7adcf049800be83425e75288eeb/recovered.edits/0000000000075377495
>
> As one might expect, attempts to achbase hang, for example:
>
> HBase Shell; enter 'help<RETURN>' for list of supported commands.
> Type "exit<RETURN>" to leave the HBase Shell
> Version: 0.89.20100924, r1001068, Fri Sep 24 13:55:42 PDT 2010
>
> hbase(main):001:0> list
> TABLE
>
>
> I'm using CDH3b2 for hdfs and the version of hbase from here:
> http://people.apache.org/~jdcryans/hbase-0.89.20100924-candidate-1
>
> Any ideas on how I can get the master to recognize the region servers? I'm
> really just concerned with how to get back up and running.
>
> Thank you
>
> Matthew
>
>