I'm still not sure how I got into this situation, but I've gotten myself out of it and I'm up and running.
The fix was to shut down the cluster and remove the .log/ files from HDFS. Then the master was able to start properly and a regionserver was able to start up and serve the -ROOT- region. One theory as to the cause of this issue (twice now), is that I was still getting bit by the issue of invalid hadoop maven jars in my classpath (see https://issues.apache.org/jira/browse/HBASE-3436) on 2 of my 4 regionservers. I'll add more commentary around HBASE-3436 in the JIRA. On Tue, Jan 25, 2011 at 3:27 PM, Bill Graham <[email protected]> wrote: > Hi, > > A developer on our team created a table today and something failed and > we fell back into the dire scenario we were in earlier this week. When > I got on the scene 2 of our 4 regions had crashed. When I brought them > back up, they wouldn't come online and the master was scrolling > messages like those in > https://issues.apache.org/jira/browse/HBASE-3406. > > I'm running 0.90.0-rc1 and CDH3b2 with append enabled. > > I shut down the entire cluster + zookeeper and restarted it. Now, I'm > getting two types of errors and the cluster won't come up: > > - On one of the regionservers: > 2011-01-25 15:12:00,287 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; Region is not online: -ROOT-,,0 > > - And on the master this scrolls every few seconds. the log file > referenced is empty in HDFS. > 2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils: > Waited 275444ms for lease recovery on > hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: > failed to create file > /hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592 > for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client > 10.14.98.90, because this file is already being created by NN_Recovery > on 10.10.220.15 > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422) > at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) > > Any suggestions for how to get the -ROOT- back? I can see it in HDFS. > > thanks, > Bill >
