I have a cluster of 3 machines where the NameNode is separate from the HMaster 
based on the distribution from Cloudera (CDH3).   I have been running it 
successfully for a couple weeks.   As of this morning, it is completely 
unusable.  I'm looking for some help on how to fix it.  Details are below.  
Thank you. 

This morning I found HBase to be unresponsive, and tried restarting it.  That 
didn't help.  For example, running "hbase shell", and then "list" hangs. 

The master and region processes start up, but the master does not recognize 
that the region servers are there.  I am getting the following in master's log 
file: 

2010-08-23 23:04:16,100 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN
2010-08-23 23:05:16,110 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN
2010-08-23 23:06:16,120 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN
2010-08-23 23:07:16,130 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN
2010-08-23 23:08:16,140 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN
2010-08-23 23:09:16,146 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN
2010-08-23 23:10:16,150 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN
2010-08-23 23:11:16,160 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN
2010-08-23 23:12:16,170 INFO org.apache.hadoop.hbase.master.ServerManager: 0 
region servers, 0 dead, average load NaN


Meanwhile, the region servers show this in their log files: 

2010-08-23 23:05:21,006 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server zookeeper:2181
2010-08-23 23:05:21,028 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to zookeeper:2181, initiating session
2010-08-23 23:05:21,168 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server zookeeper:2181, sessionid = 0x12aa0cc2520000e, 
negotiated timeout = 40000
2010-08-23 23:05:21,172 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: 
SyncConnected, type: None, path: null
2010-08-23 23:05:21,177 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Set watcher on master 
address ZNode /hbase/master
2010-08-23 23:05:21,421 DEBUG 
org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode /hbase/master 
got master:60000
2010-08-23 23:05:21,421 INFO 
org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
master:60000 that we are up
2010-08-23 23:05:22,056 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: 
Installed shutdown hook thread: Shutdownhook:regionserver60020

The Region server process is obviously waiting on something: 

/tmp/hbaselog$ sudo strace -p7592
Process 7592 attached - interrupt to quit
futex(0x7f65534739e0, FUTEX_WAIT, 7602, NULL

The Master isn't idle, it appears to be trying to do some sort of recovery 
having woken up to find 0 region servers.  Here is an excerpt from it: 

2010-08-23 23:10:06,290 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Splitting hlog 12142 of 143261: 
hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704435,
 length=1150
2010-08-23 23:10:06,290 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering 
filehdfs://namenode:9000/hbase/.log 
master,60020,1282577331142/master%3A60020.1282581704435
2010-08-23 23:10:06,510 INFO org.apache.hadoop.hbase.util.FSUtils: Finished 
lease recover attempt for hdfs://namenode:9000/hbase/.logs 
master,60020,1282577331142/master%3A60020.1282581704435
2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Pushed=3 entries from 
hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/1 
master%3A60020.1282581704435
2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Splitting hlog 12143 of 143261: 
hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451,
 length=448
2010-08-23 23:10:06,513 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering 
filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
2010-08-23 23:10:06,721 INFO org.apache.hadoop.hbase.util.FSUtils: Finished 
lease recover attempt for 
hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Pushed=2 entries from 
hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451
2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: 
Splitting hlog 12144 of 143261: 
hdfs:/namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468,
 length=582
2010-08-23 23:10:06,723 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering 
filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468


It looks like the Master is sequentially going through logs up to 143261, 
having started at 1 and is currently at 12144.   At the current rate, it will 
take around 12 hours to complete.  Do I have to wait for it to complete before 
the master will recognize the region servers?  If it doesn't have any region 
servers, then what the heck is the master doing anyway?   

Thank you for your help, 

Matthew


Reply via email to