It's all about this line:

"for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
10.14.98.90, because this file is already being created by NN_Recovery"

I'm not really sure why that happens, I've seen that on my test
clusters, and basically this holds up region redeployment hence your
problems.

Perhaps someone familiar with the deep internals of append recovery
can speak up...

-ryan


On Tue, Jan 25, 2011 at 4:02 PM, Bill Graham <[email protected]> wrote:
> I'm still not sure how I got into this situation, but I've gotten
> myself out of it and I'm up and running.
>
> The fix was to shut down the cluster and remove the .log/ files from
> HDFS. Then the master was able to start properly and a regionserver
> was able to start up and serve the -ROOT- region.
>
> One theory as to the cause of this issue (twice now), is that I was
> still getting bit by the issue of invalid hadoop maven jars in my
> classpath (see https://issues.apache.org/jira/browse/HBASE-3436) on 2
> of my 4 regionservers. I'll add more commentary around HBASE-3436 in
> the JIRA.
>
>
>
> On Tue, Jan 25, 2011 at 3:27 PM, Bill Graham <[email protected]> wrote:
>> Hi,
>>
>> A developer on our team created a table today and something failed and
>> we fell back into the dire scenario we were in earlier this week. When
>> I got on the scene 2 of our 4 regions had crashed. When I brought them
>> back up, they wouldn't come online and the master was scrolling
>> messages like those in
>> https://issues.apache.org/jira/browse/HBASE-3406.
>>
>> I'm running 0.90.0-rc1 and CDH3b2 with append enabled.
>>
>> I shut down the entire cluster + zookeeper and restarted it. Now, I'm
>> getting two types of errors and the cluster won't come up:
>>
>> - On one of the regionservers:
>> 2011-01-25 15:12:00,287 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegionServer:
>> NotServingRegionException; Region is not online: -ROOT-,,0
>>
>> - And on the master this scrolls every few seconds. the log file
>> referenced is empty in HDFS.
>> 2011-01-25 15:12:26,897 WARN org.apache.hadoop.hbase.util.FSUtils:
>> Waited 275444ms for lease recovery on
>> hdfs://mymaster.com:9000/hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
>> failed to create file
>> /hbase-app/hbase/.logs/hadoop-wkr-r14-n1.mydomain.com,60020,1295900457489/hadoop-wkr-r14-n1.mydomain.com%3A60020.1295907659592
>> for DFSClient_hb_m_mymaster.com:60000_1295996847777 on client
>> 10.14.98.90, because this file is already being created by NN_Recovery
>> on 10.10.220.15
>>        at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1093)
>>        at 
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>        at 
>> org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>        at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source)
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>        at java.security.AccessController.doPrivileged(Native Method)
>>        at javax.security.auth.Subject.doAs(Subject.java:396)
>>        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>
>> Any suggestions for how to get the -ROOT- back? I can see it in HDFS.
>>
>> thanks,
>> Bill
>>
>

Reply via email to