Matthew,

Yes, the master won't start until the log splitting is done. By the
looks of it (such a high number of logs when the max is 32 until the
region server force flushes hlogs), it seems that your region servers
in a previous run weren't able to ensure that each HLog had 3 replicas
even tho dfs.replication was set to that number. Check your previous
region server logs, you should see instances of:

        LOG.warn("HDFS pipeline error detected. " +
            "Found " + numCurrentReplicas + " replicas but expecting " +
            this.initialReplication + " replicas. " +
            " Requesting close of hlog.");

And that probably made 1 log per insert. If your still testing and
your data isn't important, feel free to kill -9 the master and nuke
the HBase HDFS folder or at least the .logs folder. If not, you'll
have to wait. Once it's done, make sure you have a proper
configuration before any other testing.

J-D

On Mon, Aug 23, 2010 at 4:22 PM, Matthew LeMieux <[email protected]> wrote:
> I have a cluster of 3 machines where the NameNode is separate from the 
> HMaster based on the distribution from Cloudera (CDH3).   I have been running 
> it successfully for a couple weeks.   As of this morning, it is completely 
> unusable.  I'm looking for some help on how to fix it.  Details are below.  
> Thank you.
>
> This morning I found HBase to be unresponsive, and tried restarting it.  That 
> didn't help.  For example, running "hbase shell", and then "list" hangs.
>
> The master and region processes start up, but the master does not recognize 
> that the region servers are there.  I am getting the following in master's 
> log file:
>
> 2010-08-23 23:04:16,100 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
> region servers, 0 dead, average load NaN
> 2010-08-23 23:05:16,110 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
> region servers, 0 dead, average load NaN
> 2010-08-23 23:06:16,120 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
> region servers, 0 dead, average load NaN
> 2010-08-23 23:07:16,130 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
> region servers, 0 dead, average load NaN
> 2010-08-23 23:08:16,140 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
> region servers, 0 dead, average load NaN
> 2010-08-23 23:09:16,146 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
> region servers, 0 dead, average load NaN
> 2010-08-23 23:10:16,150 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
> region servers, 0 dead, average load NaN
> 2010-08-23 23:11:16,160 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
> region servers, 0 dead, average load NaN
> 2010-08-23 23:12:16,170 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
> region servers, 0 dead, average load NaN
>
>
> Meanwhile, the region servers show this in their log files:
>
> 2010-08-23 23:05:21,006 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
> connection to server zookeeper:2181
> 2010-08-23 23:05:21,028 INFO org.apache.zookeeper.ClientCnxn: Socket 
> connection established to zookeeper:2181, initiating session
> 2010-08-23 23:05:21,168 INFO org.apache.zookeeper.ClientCnxn: Session 
> establishment complete on server zookeeper:2181, sessionid = 
> 0x12aa0cc2520000e, negotiated timeout = 40000
> 2010-08-23 23:05:21,172 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, 
> state: SyncConnected, type: None, path: null
> 2010-08-23 23:05:21,177 DEBUG 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Set watcher on master 
> address ZNode /hbase/master
> 2010-08-23 23:05:21,421 DEBUG 
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode /hbase/master 
> got master:60000
> 2010-08-23 23:05:21,421 INFO 
> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
> master:60000 that we are up
> 2010-08-23 23:05:22,056 INFO 
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook 
> thread: Shutdownhook:regionserver60020
>
> The Region server process is obviously waiting on something:
>
> /tmp/hbaselog$ sudo strace -p7592
> Process 7592 attached - interrupt to quit
> futex(0x7f65534739e0, FUTEX_WAIT, 7602, NULL
>
> The Master isn't idle, it appears to be trying to do some sort of recovery 
> having woken up to find 0 region servers.  Here is an excerpt from it:
>
> 2010-08-23 23:10:06,290 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
> Splitting hlog 12142 of 143261: 
> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704435,
>  length=1150
> 2010-08-23 23:10:06,290 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering 
> filehdfs://namenode:9000/hbase/.log 
> master,60020,1282577331142/master%3A60020.1282581704435
> 2010-08-23 23:10:06,510 INFO org.apache.hadoop.hbase.util.FSUtils: Finished 
> lease recover attempt for hdfs://namenode:9000/hbase/.logs 
> master,60020,1282577331142/master%3A60020.1282581704435
> 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
> Pushed=3 entries from 
> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/1 
> master%3A60020.1282581704435
> 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
> Splitting hlog 12143 of 143261: 
> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451,
>  length=448
> 2010-08-23 23:10:06,513 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering 
> filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
> 2010-08-23 23:10:06,721 INFO org.apache.hadoop.hbase.util.FSUtils: Finished 
> lease recover attempt for 
> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
> 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
> Pushed=2 entries from 
> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
> 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
> Splitting hlog 12144 of 143261: 
> hdfs:/namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468,
>  length=582
> 2010-08-23 23:10:06,723 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering 
> filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468
>
>
> It looks like the Master is sequentially going through logs up to 143261, 
> having started at 1 and is currently at 12144.   At the current rate, it will 
> take around 12 hours to complete.  Do I have to wait for it to complete 
> before the master will recognize the region servers?  If it doesn't have any 
> region servers, then what the heck is the master doing anyway?
>
> Thank you for your help,
>
> Matthew
>
>
>

Reply via email to