Re: HBase crash, need help getting back up

Stack Wed, 08 Sep 2010 22:01:00 -0700

recovered.edits is the name of the file produced when wal logs are
split; one is made per region


Where you seeing that message?  Does it not have the full path the
recovered.edits file?

You are running w/ perms enabled on this cluster?

Why did the regionservers go down?

St.Ack

On Wed, Sep 8, 2010 at 9:54 PM, Matthew LeMieux <[email protected]> wrote:
> Well, it was short lived, it only stayed up for a couple hours, all region 
> servers crashed this time, not just one.
>
> Now, after restarting, I've got the master server complaining about not 
> having executable permissions on "recovered.edits".  Where is this file?
>
>  Caused by: org.apache.hadoop.ipc.RemoteException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=mlcamus, access=EXECUTE, 
> inode="recovered.edits":mlcamus:supergroup:rw-r--r--
>
> The message has repeated for a half hour, with this showing up in one region 
> server:
>
> 2010-09-09 04:52:34,887 DEBUG 
> org.apache.hadoop.hbase.regionserver.HRegionServer: 
> NotServingRegionException; -ROOT-,,0
>
> I assume this will get better if I change permissions of some file... which 
> one?
>
> -Matthew
>
>
> On Sep 8, 2010, at 6:21 PM, Matthew LeMieux wrote:
>
>> I tried moving that file to tmp.  It appears as though the master is no 
>> longer stuck, but clients are still not able to run queries.
>>
>> There aren't any messages passing by in the log files (just routine messages 
>> I see when the server isn't doing anything), but attempts to run queries 
>> resulted in not server region exceptions (i.e., count 'table').
>>
>> I tried enable 'table', and found that after this command there was a huge 
>> amount of activity in the log files, and I was able to run queries again.
>>
>> There was no previous call to disable 'table', but for some reason HBase 
>> wasn't bringing tables/regions online.
>>
>> I'm not sure what caused the problem or even if the actions I took will fix 
>> it again in the future, but I am back up and running for now.
>>
>> FYI,
>>
>> -Matthew
>>
>> On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote:
>>
>>> My HBase cluster just crashed.   One of the Region servers stopped (do not 
>>> yet know why).  After restarting it, the cluster seemed a but wobbly, so I 
>>> decided to shutdown everything, and restart fresh.  I did so (including 
>>> zookeeper and HDFS).
>>>
>>> Upon restart, I'm getting the following message in the Master's log file 
>>> repeating continuously with the number of ms waited counting up.
>>>
>>> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 
>>> 69188ms for lease recovery on 
>>> hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
>>>  failed to create file 
>>> /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298
>>>  for DFSClient_hb_m_10.104.37.247:60000 on client 10.104.37.247 because 
>>> current leaseholder is trying to recreate file.
>>>       at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
>>>       at 
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>>       at 
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>>       at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>       at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>>       at java.security.AccessController.doPrivileged(Native Method)
>>>       at javax.security.auth.Subject.doAs(Subject.java:396)
>>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>>
>>>
>>> The region servers are waiting with this being the final message in their 
>>> log file:
>>>
>>> 2010-09-09 00:53:49,111 INFO 
>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
>>> 10.104.37.247:60000 that we are up
>>>
>>> I've  been using this version for a little under a week without incident 
>>> (http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).
>>>
>>> The HDFS comes from CDH3.
>>>
>>> Does anybody have any ideas on what I can do to get back up and running?
>>>
>>> Thank you,
>>>
>>> Matthew
>>>
>>
>
>

Re: HBase crash, need help getting back up

Reply via email to