Matthew,
Yes, the master won't start until the log splitting is done. By the
looks of it (such a high number of logs when the max is 32 until the
region server force flushes hlogs), it seems that your region servers
in a previous run weren't able to ensure that each HLog had 3 replicas
even tho dfs.replication was set to that number. Check your previous
region server logs, you should see instances of:
LOG.warn("HDFS pipeline error detected. " +
"Found " + numCurrentReplicas + " replicas but expecting " +
this.initialReplication + " replicas. " +
" Requesting close of hlog.");
And that probably made 1 log per insert. If your still testing and
your data isn't important, feel free to kill -9 the master and nuke
the HBase HDFS folder or at least the .logs folder. If not, you'll
have to wait. Once it's done, make sure you have a proper
configuration before any other testing.
J-D
On Mon, Aug 23, 2010 at 4:22 PM, Matthew LeMieux <[email protected]> wrote:
> I have a cluster of 3 machines where the NameNode is separate from the
> HMaster based on the distribution from Cloudera (CDH3). I have been running
> it successfully for a couple weeks. As of this morning, it is completely
> unusable. I'm looking for some help on how to fix it. Details are below.
> Thank you.
>
> This morning I found HBase to be unresponsive, and tried restarting it. That
> didn't help. For example, running "hbase shell", and then "list" hangs.
>
> The master and region processes start up, but the master does not recognize
> that the region servers are there. I am getting the following in master's
> log file:
>
> 2010-08-23 23:04:16,100 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
> 2010-08-23 23:05:16,110 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
> 2010-08-23 23:06:16,120 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
> 2010-08-23 23:07:16,130 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
> 2010-08-23 23:08:16,140 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
> 2010-08-23 23:09:16,146 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
> 2010-08-23 23:10:16,150 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
> 2010-08-23 23:11:16,160 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
> 2010-08-23 23:12:16,170 INFO org.apache.hadoop.hbase.master.ServerManager: 0
> region servers, 0 dead, average load NaN
>
>
> Meanwhile, the region servers show this in their log files:
>
> 2010-08-23 23:05:21,006 INFO org.apache.zookeeper.ClientCnxn: Opening socket
> connection to server zookeeper:2181
> 2010-08-23 23:05:21,028 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to zookeeper:2181, initiating session
> 2010-08-23 23:05:21,168 INFO org.apache.zookeeper.ClientCnxn: Session
> establishment complete on server zookeeper:2181, sessionid =
> 0x12aa0cc2520000e, negotiated timeout = 40000
> 2010-08-23 23:05:21,172 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
> state: SyncConnected, type: None, path: null
> 2010-08-23 23:05:21,177 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Set watcher on master
> address ZNode /hbase/master
> 2010-08-23 23:05:21,421 DEBUG
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode /hbase/master
> got master:60000
> 2010-08-23 23:05:21,421 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at
> master:60000 that we are up
> 2010-08-23 23:05:22,056 INFO
> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook
> thread: Shutdownhook:regionserver60020
>
> The Region server process is obviously waiting on something:
>
> /tmp/hbaselog$ sudo strace -p7592
> Process 7592 attached - interrupt to quit
> futex(0x7f65534739e0, FUTEX_WAIT, 7602, NULL
>
> The Master isn't idle, it appears to be trying to do some sort of recovery
> having woken up to find 0 region servers. Here is an excerpt from it:
>
> 2010-08-23 23:10:06,290 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
> Splitting hlog 12142 of 143261:
> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704435,
> length=1150
> 2010-08-23 23:10:06,290 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering
> filehdfs://namenode:9000/hbase/.log
> master,60020,1282577331142/master%3A60020.1282581704435
> 2010-08-23 23:10:06,510 INFO org.apache.hadoop.hbase.util.FSUtils: Finished
> lease recover attempt for hdfs://namenode:9000/hbase/.logs
> master,60020,1282577331142/master%3A60020.1282581704435
> 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
> Pushed=3 entries from
> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/1
> master%3A60020.1282581704435
> 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
> Splitting hlog 12143 of 143261:
> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451,
> length=448
> 2010-08-23 23:10:06,513 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering
> filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
> 2010-08-23 23:10:06,721 INFO org.apache.hadoop.hbase.util.FSUtils: Finished
> lease recover attempt for
> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
> 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
> Pushed=2 entries from
> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
> 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
> Splitting hlog 12144 of 143261:
> hdfs:/namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468,
> length=582
> 2010-08-23 23:10:06,723 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering
> filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468
>
>
> It looks like the Master is sequentially going through logs up to 143261,
> having started at 1 and is currently at 12144. At the current rate, it will
> take around 12 hours to complete. Do I have to wait for it to complete
> before the master will recognize the region servers? If it doesn't have any
> region servers, then what the heck is the master doing anyway?
>
> Thank you for your help,
>
> Matthew
>
>
>