These jiras might be related: https://issues.apache.org/jira/browse/HDFS-1520
https://issues.apache.org/jira/browse/HDFS-1554 I'm not sure they would help in this situation, since the client 'NN_Recovery' isn't a "real" client (ie: a hbase regionserver). On Tue, Jan 25, 2011 at 6:59 PM, Ryan Rawson <[email protected]> wrote: > It's all about this line: > > "for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client > 10.14.98.90, because this file is already being created by NN_Recovery" > > I'm not really sure why that happens, I've seen that on my test > clusters, and basically this holds up region redeployment hence your > problems. > > Perhaps someone familiar with the deep internals of append recovery > can speak up... > > -ryan > > > On Tue, Jan 25, 2011 at 4:02 PM, Bill Graham <[email protected]> wrote: >> I'm still not sure how I got into this situation, but I've gotten >> myself out of it and I'm up and running. >> >> The fix was to shut down the cluster and remove the .log/ files from >> HDFS. Then the master was able to start properly and a regionserver >> was able to start up and serve the -ROOT- region. >> >> One theory as to the cause of this issue (twice now), is that I was >> still getting bit by the issue of invalid hadoop maven jars in my >> classpath (see https://issues.apache.org/jira/browse/HBASE-3436) on 2 >> of my 4 regionservers. I'll add more commentary around HBASE-3436 in >> the JIRA. >> >> >> >> On Tue, Jan 25, 2011 at 3:27 PM, Bill Graham <[email protected]> wrote: >>> Hi, >>> >>> A developer on our team created a table today and something failed and >>> we fell back into the dire scenario we were in earlier this week. When >>> I got on the scene 2 of our 4 regions had crashed. When I brought them >>> back up, they wouldn't come online and the master was scrolling >>> messages like those in >>> https://issues.apache.org/jira/browse/HBASE-3406. >>> >>> I'm running 0.90.0-rc1 and CDH3b2 with append enabled. >>> >>> I shut down the entire cluster + zookeeper and restarted it. Now, I'm >>> getting two types of errors and the cluster won't come up: >>> >>> - On one of the regionservers: >>> 2011-01-25 15:12:00,287 DEBUG >>> org.apache.hadoop.hbase.regionserver.HRegionServer: >>> NotServingRegionException; Region is not online: -ROOT-,,0 >>> >>> - And on the master this scrolls every few seconds. the log file >>> referenced is empty in HDFS. >>> 2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils: >>> Waited 275444ms for lease recovery on >>> hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: >>> failed to create file >>> /hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592 >>> for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client >>> 10.14.98.90, because this file is already being created by NN_Recovery >>> on 10.10.220.15 >>> at >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093) >>> at >>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181) >>> at >>> org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422) >>> at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) >>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) >>> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) >>> at java.security.AccessController.doPrivileged(Native Method) >>> at javax.security.auth.Subject.doAs(Subject.java:396) >>> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) >>> >>> Any suggestions for how to get the -ROOT- back? I can see it in HDFS. >>> >>> thanks, >>> Bill >>> >> >
