J-D,
Thank you for the very fast response. I'm running in the Amazon cloud,
and would like to be able to expand the cluster and reduce the cluster quickly
to support hadoop jobs, so I've been playing with the replication factor and
number of machines.
If I add 5 more machines for a job I'm running over night that needs some
extra power, then removing those machines the next morning becomes a tedious
task because some of the data generated resides perhaps exclusively on the 5
new machines. Since the HDFS node decommission process takes so long, I've
been putting the replication factor high, and removing machines incrementally.
For the most part, it works great because the replication speed is much faster
than the decommission speed. Hadoop M/R jobs and HDFS seem completely content
to deal with nodes coming and going as long as there are at least 1 replica
from which to replicate. Since the Hadoop M/R jobs pull data from HBase, do
some processing, and then put the data back into HBase I assumed that hosting
the Hadoop M/R and HBase on the same HDFS would be faster than copying the data
to a new transient cluster.
Is there a way to tell HBase that I want the replication factor to be
high, but not to panic (i.e., go crazy with logs) if it isn't actually at that
number? Sort of like a soft limit and a hard limit...
Thank you,
Matthew
P.S. I do need the data, I wish there was a way to speed this up. I didn't
find any evidence of a crash, just a looooong recovery.
On Aug 23, 2010, at 4:30 PM, Jean-Daniel Cryans wrote:
> Matthew,
>
> Yes, the master won't start until the log splitting is done. By the
> looks of it (such a high number of logs when the max is 32 until the
> region server force flushes hlogs), it seems that your region servers
> in a previous run weren't able to ensure that each HLog had 3 replicas
> even tho dfs.replication was set to that number. Check your previous
> region server logs, you should see instances of:
>
> LOG.warn("HDFS pipeline error detected. " +
> "Found " + numCurrentReplicas + " replicas but expecting " +
> this.initialReplication + " replicas. " +
> " Requesting close of hlog.");
>
> And that probably made 1 log per insert. If your still testing and
> your data isn't important, feel free to kill -9 the master and nuke
> the HBase HDFS folder or at least the .logs folder. If not, you'll
> have to wait. Once it's done, make sure you have a proper
> configuration before any other testing.
>
> J-D
>
> On Mon, Aug 23, 2010 at 4:22 PM, Matthew LeMieux <[email protected]> wrote:
>> I have a cluster of 3 machines where the NameNode is separate from the
>> HMaster based on the distribution from Cloudera (CDH3). I have been
>> running it successfully for a couple weeks. As of this morning, it is
>> completely unusable. I'm looking for some help on how to fix it. Details
>> are below. Thank you.
>>
>> This morning I found HBase to be unresponsive, and tried restarting it.
>> That didn't help. For example, running "hbase shell", and then "list" hangs.
>>
>> The master and region processes start up, but the master does not recognize
>> that the region servers are there. I am getting the following in master's
>> log file:
>>
>> 2010-08-23 23:04:16,100 INFO org.apache.hadoop.hbase.master.ServerManager: 0
>> region servers, 0 dead, average load NaN
>> 2010-08-23 23:05:16,110 INFO org.apache.hadoop.hbase.master.ServerManager: 0
>> region servers, 0 dead, average load NaN
>> 2010-08-23 23:06:16,120 INFO org.apache.hadoop.hbase.master.ServerManager: 0
>> region servers, 0 dead, average load NaN
>> 2010-08-23 23:07:16,130 INFO org.apache.hadoop.hbase.master.ServerManager: 0
>> region servers, 0 dead, average load NaN
>> 2010-08-23 23:08:16,140 INFO org.apache.hadoop.hbase.master.ServerManager: 0
>> region servers, 0 dead, average load NaN
>> 2010-08-23 23:09:16,146 INFO org.apache.hadoop.hbase.master.ServerManager: 0
>> region servers, 0 dead, average load NaN
>> 2010-08-23 23:10:16,150 INFO org.apache.hadoop.hbase.master.ServerManager: 0
>> region servers, 0 dead, average load NaN
>> 2010-08-23 23:11:16,160 INFO org.apache.hadoop.hbase.master.ServerManager: 0
>> region servers, 0 dead, average load NaN
>> 2010-08-23 23:12:16,170 INFO org.apache.hadoop.hbase.master.ServerManager: 0
>> region servers, 0 dead, average load NaN
>>
>>
>> Meanwhile, the region servers show this in their log files:
>>
>> 2010-08-23 23:05:21,006 INFO org.apache.zookeeper.ClientCnxn: Opening socket
>> connection to server zookeeper:2181
>> 2010-08-23 23:05:21,028 INFO org.apache.zookeeper.ClientCnxn: Socket
>> connection established to zookeeper:2181, initiating session
>> 2010-08-23 23:05:21,168 INFO org.apache.zookeeper.ClientCnxn: Session
>> establishment complete on server zookeeper:2181, sessionid =
>> 0x12aa0cc2520000e, negotiated timeout = 40000
>> 2010-08-23 23:05:21,172 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event,
>> state: SyncConnected, type: None, path: null
>> 2010-08-23 23:05:21,177 DEBUG
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Set watcher on master
>> address ZNode /hbase/master
>> 2010-08-23 23:05:21,421 DEBUG
>> org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode /hbase/master
>> got master:60000
>> 2010-08-23 23:05:21,421 INFO
>> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at
>> master:60000 that we are up
>> 2010-08-23 23:05:22,056 INFO
>> org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook
>> thread: Shutdownhook:regionserver60020
>>
>> The Region server process is obviously waiting on something:
>>
>> /tmp/hbaselog$ sudo strace -p7592
>> Process 7592 attached - interrupt to quit
>> futex(0x7f65534739e0, FUTEX_WAIT, 7602, NULL
>>
>> The Master isn't idle, it appears to be trying to do some sort of recovery
>> having woken up to find 0 region servers. Here is an excerpt from it:
>>
>> 2010-08-23 23:10:06,290 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
>> Splitting hlog 12142 of 143261:
>> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704435,
>> length=1150
>> 2010-08-23 23:10:06,290 INFO org.apache.hadoop.hbase.util.FSUtils:
>> Recovering filehdfs://namenode:9000/hbase/.log
>> master,60020,1282577331142/master%3A60020.1282581704435
>> 2010-08-23 23:10:06,510 INFO org.apache.hadoop.hbase.util.FSUtils: Finished
>> lease recover attempt for hdfs://namenode:9000/hbase/.logs
>> master,60020,1282577331142/master%3A60020.1282581704435
>> 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
>> Pushed=3 entries from
>> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/1
>> master%3A60020.1282581704435
>> 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
>> Splitting hlog 12143 of 143261:
>> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451,
>> length=448
>> 2010-08-23 23:10:06,513 INFO org.apache.hadoop.hbase.util.FSUtils:
>> Recovering
>> filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
>> 2010-08-23 23:10:06,721 INFO org.apache.hadoop.hbase.util.FSUtils: Finished
>> lease recover attempt for
>> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
>> 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
>> Pushed=2 entries from
>> hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
>> 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog:
>> Splitting hlog 12144 of 143261:
>> hdfs:/namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468,
>> length=582
>> 2010-08-23 23:10:06,723 INFO org.apache.hadoop.hbase.util.FSUtils:
>> Recovering
>> filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468
>>
>>
>> It looks like the Master is sequentially going through logs up to 143261,
>> having started at 1 and is currently at 12144. At the current rate, it
>> will take around 12 hours to complete. Do I have to wait for it to complete
>> before the master will recognize the region servers? If it doesn't have any
>> region servers, then what the heck is the master doing anyway?
>>
>> Thank you for your help,
>>
>> Matthew
>>
>>
>>