Replies below

On Sep 8, 2010, at 10:00 PM, Stack wrote:

> recovered.edits is the name of the file produced when wal logs are
> split; one is made per region
> 
> Where you seeing that message?  Does it not have the full path the
> recovered.edits file?
> 

In the master log file.  Full path was not there. 

> You are running w/ perms enabled on this cluster?
> 

It was enabled and it has now been turned off.  Will that fix the problem of a 
file not being executable?  In any case that problem is intermittent.  It 
usually shows up only after a partial restart (i.e. a Region server goes down 
and I restart it), but does not show up after a complete restart of the whole 
cluster. 

> Why did the regionservers go down?

I tracked the reason for the most recent "crash" down to "too many open files" 
for the user that runs hadoop.  Very odd situation, both the user running hbase 
and hadoop were in the /etc/security/limits.conf file with a limit of 50000, 
but the change only worked for one user.   hadoop's account reported 1024, and 
the hbase user's account reported 50000 to 'ulimit -n'.   I did three things 
before rebooting the machine, not sure which were needed to fix it: 
   
    *  I added "session required        pam_limits.so" to 
/etc/pam.d/common-session (pam_limits.so was already being referenced in 
several other files in /etc/pam.d, but was missing from this file)
    *  gave hadoop a home directory that exists (by editing the /etc/passwd 
file)
    *  I added "*                hard    nofile          50000" to the 
/etc/security/limits.conf file (in addition to the two lines for each user that 
were already there)

(on Ubuntu Karmic, running CDH version: 0.20.2+320-1~karmic-cdh3b2)

The CDH distribution doesn't appear to have the hadoop home directory situation 
figured out (they put it in a directory that gets deleted on reboots).  I 
change it routinely, but apparently missed this machine.  

This is likely to fix quite a few problems, but I think there is still a 
mystery to be solved.  I'll have to wait until it happens again to get a clean 
log of the event. 

FYI,

Matthew


> On Wed, Sep 8, 2010 at 9:54 PM, Matthew LeMieux <[email protected]> wrote:
>> Well, it was short lived, it only stayed up for a couple hours, all region 
>> servers crashed this time, not just one.
>> 
>> Now, after restarting, I've got the master server complaining about not 
>> having executable permissions on "recovered.edits".  Where is this file?
>> 
>>  Caused by: org.apache.hadoop.ipc.RemoteException: 
>> org.apache.hadoop.security.AccessControlException: Permission denied: 
>> user=mlcamus, access=EXECUTE, 
>> inode="recovered.edits":mlcamus:supergroup:rw-r--r--
>> 
>> The message has repeated for a half hour, with this showing up in one region 
>> server:
>> 
>> 2010-09-09 04:52:34,887 DEBUG 
>> org.apache.hadoop.hbase.regionserver.HRegionServer: 
>> NotServingRegionException; -ROOT-,,0
>> 
>> I assume this will get better if I change permissions of some file... which 
>> one?
>> 
>> -Matthew
>> 
>> 
>> On Sep 8, 2010, at 6:21 PM, Matthew LeMieux wrote:
>> 
>>> I tried moving that file to tmp.  It appears as though the master is no 
>>> longer stuck, but clients are still not able to run queries.
>>> 
>>> There aren't any messages passing by in the log files (just routine 
>>> messages I see when the server isn't doing anything), but attempts to run 
>>> queries resulted in not server region exceptions (i.e., count 'table').
>>> 
>>> I tried enable 'table', and found that after this command there was a huge 
>>> amount of activity in the log files, and I was able to run queries again.
>>> 
>>> There was no previous call to disable 'table', but for some reason HBase 
>>> wasn't bringing tables/regions online.
>>> 
>>> I'm not sure what caused the problem or even if the actions I took will fix 
>>> it again in the future, but I am back up and running for now.
>>> 
>>> FYI,
>>> 
>>> -Matthew
>>> 
>>> On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote:
>>> 
>>>> My HBase cluster just crashed.   One of the Region servers stopped (do not 
>>>> yet know why).  After restarting it, the cluster seemed a but wobbly, so I 
>>>> decided to shutdown everything, and restart fresh.  I did so (including 
>>>> zookeeper and HDFS).
>>>> 
>>>> Upon restart, I'm getting the following message in the Master's log file 
>>>> repeating continuously with the number of ms waited counting up.
>>>> 
>>>> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils: Waited 
>>>> 69188ms for lease recovery on 
>>>> hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
>>>>  failed to create file 
>>>> /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/10.215.59.191%3A60020.1283905909298
>>>>  for DFSClient_hb_m_10.104.37.247:60000 on client 10.104.37.247 because 
>>>> current leaseholder is trying to recreate file.
>>>>       at 
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
>>>>       at 
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
>>>>       at 
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
>>>>       at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
>>>>       at 
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>       at java.lang.reflect.Method.invoke(Method.java:597)
>>>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
>>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
>>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
>>>>       at java.security.AccessController.doPrivileged(Native Method)
>>>>       at javax.security.auth.Subject.doAs(Subject.java:396)
>>>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
>>>> 
>>>> 
>>>> The region servers are waiting with this being the final message in their 
>>>> log file:
>>>> 
>>>> 2010-09-09 00:53:49,111 INFO 
>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at 
>>>> 10.104.37.247:60000 that we are up
>>>> 
>>>> I've  been using this version for a little under a week without incident 
>>>> (http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).
>>>> 
>>>> The HDFS comes from CDH3.
>>>> 
>>>> Does anybody have any ideas on what I can do to get back up and running?
>>>> 
>>>> Thank you,
>>>> 
>>>> Matthew
>>>> 
>>> 
>> 
>> 

Reply via email to