It's all about this line: "for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client 10.14.98.90, because this file is already being created by NN_Recovery"
I'm not really sure why that happens, I've seen that on my test clusters, and basically this holds up region redeployment hence your problems. Perhaps someone familiar with the deep internals of append recovery can speak up... -ryan On Tue, Jan 25, 2011 at 4:02 PM, Bill Graham <[email protected]> wrote: > I'm still not sure how I got into this situation, but I've gotten > myself out of it and I'm up and running. > > The fix was to shut down the cluster and remove the .log/ files from > HDFS. Then the master was able to start properly and a regionserver > was able to start up and serve the -ROOT- region. > > One theory as to the cause of this issue (twice now), is that I was > still getting bit by the issue of invalid hadoop maven jars in my > classpath (see https://issues.apache.org/jira/browse/HBASE-3436) on 2 > of my 4 regionservers. I'll add more commentary around HBASE-3436 in > the JIRA. > > > > On Tue, Jan 25, 2011 at 3:27 PM, Bill Graham <[email protected]> wrote: >> Hi, >> >> A developer on our team created a table today and something failed and >> we fell back into the dire scenario we were in earlier this week. When >> I got on the scene 2 of our 4 regions had crashed. When I brought them >> back up, they wouldn't come online and the master was scrolling >> messages like those in >> https://issues.apache.org/jira/browse/HBASE-3406. >> >> I'm running 0.90.0-rc1 and CDH3b2 with append enabled. >> >> I shut down the entire cluster + zookeeper and restarted it. Now, I'm >> getting two types of errors and the cluster won't come up: >> >> - On one of the regionservers: >> 2011-01-25 15:12:00,287 DEBUG >> org.apache.hadoop.hbase.regionserver.HRegionServer: >> NotServingRegionException; Region is not online: -ROOT-,,0 >> >> - And on the master this scrolls every few seconds. the log file >> referenced is empty in HDFS. >> 2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils: >> Waited 275444ms for lease recovery on >> hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: >> failed to create file >> /hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592 >> for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client >> 10.14.98.90, because this file is already being created by NN_Recovery >> on 10.10.220.15 >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422) >> at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) >> >> Any suggestions for how to get the -ROOT- back? I can see it in HDFS. >> >> thanks, >> Bill >> >
