I have a cluster of 3 machines where the NameNode is separate from the HMaster based on the distribution from Cloudera (CDH3). I have been running it successfully for a couple weeks. As of this morning, it is completely unusable. I'm looking for some help on how to fix it. Details are below. Thank you.
This morning I found HBase to be unresponsive, and tried restarting it. That didn't help. For example, running "hbase shell", and then "list" hangs. The master and region processes start up, but the master does not recognize that the region servers are there. I am getting the following in master's log file: 2010-08-23 23:04:16,100 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN 2010-08-23 23:05:16,110 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN 2010-08-23 23:06:16,120 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN 2010-08-23 23:07:16,130 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN 2010-08-23 23:08:16,140 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN 2010-08-23 23:09:16,146 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN 2010-08-23 23:10:16,150 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN 2010-08-23 23:11:16,160 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN 2010-08-23 23:12:16,170 INFO org.apache.hadoop.hbase.master.ServerManager: 0 region servers, 0 dead, average load NaN Meanwhile, the region servers show this in their log files: 2010-08-23 23:05:21,006 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server zookeeper:2181 2010-08-23 23:05:21,028 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to zookeeper:2181, initiating session 2010-08-23 23:05:21,168 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server zookeeper:2181, sessionid = 0x12aa0cc2520000e, negotiated timeout = 40000 2010-08-23 23:05:21,172 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Got ZooKeeper event, state: SyncConnected, type: None, path: null 2010-08-23 23:05:21,177 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Set watcher on master address ZNode /hbase/master 2010-08-23 23:05:21,421 DEBUG org.apache.hadoop.hbase.zookeeper.ZooKeeperWrapper: Read ZNode /hbase/master got master:60000 2010-08-23 23:05:21,421 INFO org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at master:60000 that we are up 2010-08-23 23:05:22,056 INFO org.apache.hadoop.hbase.regionserver.ShutdownHook: Installed shutdown hook thread: Shutdownhook:regionserver60020 The Region server process is obviously waiting on something: /tmp/hbaselog$ sudo strace -p7592 Process 7592 attached - interrupt to quit futex(0x7f65534739e0, FUTEX_WAIT, 7602, NULL The Master isn't idle, it appears to be trying to do some sort of recovery having woken up to find 0 region servers. Here is an excerpt from it: 2010-08-23 23:10:06,290 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog 12142 of 143261: hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704435, length=1150 2010-08-23 23:10:06,290 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering filehdfs://namenode:9000/hbase/.log master,60020,1282577331142/master%3A60020.1282581704435 2010-08-23 23:10:06,510 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs master,60020,1282577331142/master%3A60020.1282581704435 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Pushed=3 entries from hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/1 master%3A60020.1282581704435 2010-08-23 23:10:06,513 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog 12143 of 143261: hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451, length=448 2010-08-23 23:10:06,513 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451 2010-08-23 23:10:06,721 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Pushed=2 entries from hdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704451 2010-08-23 23:10:06,723 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLog: Splitting hlog 12144 of 143261: hdfs:/namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468, length=582 2010-08-23 23:10:06,723 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering filehdfs://namenode:9000/hbase/.logs/master,60020,1282577331142/master%3A60020.1282581704468 It looks like the Master is sequentially going through logs up to 143261, having started at 1 and is currently at 12144. At the current rate, it will take around 12 hours to complete. Do I have to wait for it to complete before the master will recognize the region servers? If it doesn't have any region servers, then what the heck is the master doing anyway? Thank you for your help, Matthew
