Re: HBase crash, need help getting back up

Todd Lipcon Thu, 09 Sep 2010 11:22:59 -0700

I think the root issue you ran into is HBASE-2975, which I coincidentally
also found last night. The fix is committed and should be in our next
rc/release.


Thanks
-Todd

On Thu, Sep 9, 2010 at 10:24 AM, Matthew LeMieux <[email protected]> wrote:

> Replies below
>
> On Sep 8, 2010, at 10:00 PM, Stack wrote:
>
> > recovered.edits is the name of the file produced when wal logs are
> > split; one is made per region
> >
> > Where you seeing that message?  Does it not have the full path the
> > recovered.edits file?
> >
>
> In the master log file.  Full path was not there.
>
> > You are running w/ perms enabled on this cluster?
> >
>
> It was enabled and it has now been turned off.  Will that fix the problem
> of a file not being executable?  In any case that problem is intermittent.
>  It usually shows up only after a partial restart (i.e. a Region server goes
> down and I restart it), but does not show up after a complete restart of the
> whole cluster.
>
> > Why did the regionservers go down?
>
> I tracked the reason for the most recent "crash" down to "too many open
> files" for the user that runs hadoop.  Very odd situation, both the user
> running hbase and hadoop were in the /etc/security/limits.conf file with a
> limit of 50000, but the change only worked for one user.   hadoop's account
> reported 1024, and the hbase user's account reported 50000 to 'ulimit -n'.
> I did three things before rebooting the machine, not sure which were needed
> to fix it:
>
>    *  I added "session required        pam_limits.so" to
> /etc/pam.d/common-session (pam_limits.so was already being referenced in
> several other files in /etc/pam.d, but was missing from this file)
>    *  gave hadoop a home directory that exists (by editing the /etc/passwd
> file)
>    *  I added "*                hard    nofile          50000" to the
> /etc/security/limits.conf file (in addition to the two lines for each user
> that were already there)
>
> (on Ubuntu Karmic, running CDH version: 0.20.2+320-1~karmic-cdh3b2)
>
> The CDH distribution doesn't appear to have the hadoop home directory
> situation figured out (they put it in a directory that gets deleted on
> reboots).  I change it routinely, but apparently missed this machine.
>
> This is likely to fix quite a few problems, but I think there is still a
> mystery to be solved.  I'll have to wait until it happens again to get a
> clean log of the event.
>
> FYI,
>
> Matthew
>
>
> > On Wed, Sep 8, 2010 at 9:54 PM, Matthew LeMieux <[email protected]>
> wrote:
> >> Well, it was short lived, it only stayed up for a couple hours, all
> region servers crashed this time, not just one.
> >>
> >> Now, after restarting, I've got the master server complaining about not
> having executable permissions on "recovered.edits".  Where is this file?
> >>
> >>  Caused by: org.apache.hadoop.ipc.RemoteException:
> org.apache.hadoop.security.AccessControlException: Permission denied:
> user=mlcamus, access=EXECUTE,
> inode="recovered.edits":mlcamus:supergroup:rw-r--r--
> >>
> >> The message has repeated for a half hour, with this showing up in one
> region server:
> >>
> >> 2010-09-09 04:52:34,887 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> NotServingRegionException; -ROOT-,,0
> >>
> >> I assume this will get better if I change permissions of some file...
> which one?
> >>
> >> -Matthew
> >>
> >>
> >> On Sep 8, 2010, at 6:21 PM, Matthew LeMieux wrote:
> >>
> >>> I tried moving that file to tmp.  It appears as though the master is no
> longer stuck, but clients are still not able to run queries.
> >>>
> >>> There aren't any messages passing by in the log files (just routine
> messages I see when the server isn't doing anything), but attempts to run
> queries resulted in not server region exceptions (i.e., count 'table').
> >>>
> >>> I tried enable 'table', and found that after this command there was a
> huge amount of activity in the log files, and I was able to run queries
> again.
> >>>
> >>> There was no previous call to disable 'table', but for some reason
> HBase wasn't bringing tables/regions online.
> >>>
> >>> I'm not sure what caused the problem or even if the actions I took will
> fix it again in the future, but I am back up and running for now.
> >>>
> >>> FYI,
> >>>
> >>> -Matthew
> >>>
> >>> On Sep 8, 2010, at 6:00 PM, Matthew LeMieux wrote:
> >>>
> >>>> My HBase cluster just crashed.   One of the Region servers stopped (do
> not yet know why).  After restarting it, the cluster seemed a but wobbly, so
> I decided to shutdown everything, and restart fresh.  I did so (including
> zookeeper and HDFS).
> >>>>
> >>>> Upon restart, I'm getting the following message in the Master's log
> file repeating continuously with the number of ms waited counting up.
> >>>>
> >>>> 2010-09-09 00:54:58,406 WARN org.apache.hadoop.hbase.util.FSUtils:
> Waited 69188ms for lease recovery on
> hdfs://domU-12-31-39-18-12-05.compute-1.internal:9000/hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/
> 10.215.59.191%3A60020.1283905909298:org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
> failed to create file
> /hbase/.logs/domU-12-31-39-0C-38-31.compute-1.internal,60020,1283905848540/
> 10.215.59.191%3A60020.1283905909298 for DFSClient_hb_m_10.104.37.247:60000
> on client 10.104.37.247 because current leaseholder is trying to recreate
> file.
> >>>>       at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInternal(FSNamesystem.java:1068)
> >>>>       at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:1181)
> >>>>       at
> org.apache.hadoop.hdfs.server.namenode.NameNode.append(NameNode.java:422)
> >>>>       at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
> >>>>       at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>>       at java.lang.reflect.Method.invoke(Method.java:597)
> >>>>       at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512)
> >>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968)
> >>>>       at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964)
> >>>>       at java.security.AccessController.doPrivileged(Native Method)
> >>>>       at javax.security.auth.Subject.doAs(Subject.java:396)
> >>>>       at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962)
> >>>>
> >>>>
> >>>> The region servers are waiting with this being the final message in
> their log file:
> >>>>
> >>>> 2010-09-09 00:53:49,111 INFO
> org.apache.hadoop.hbase.regionserver.HRegionServer: Telling master at
> 10.104.37.247:60000 that we are up
> >>>>
> >>>> I've  been using this version for a little under a week without
> incident (
> http://people.apache.org/~jdcryans/hbase-0.89.20100830-candidate-1/ ).
> >>>>
> >>>> The HDFS comes from CDH3.
> >>>>
> >>>> Does anybody have any ideas on what I can do to get back up and
> running?
> >>>>
> >>>> Thank you,
> >>>>
> >>>> Matthew
> >>>>
> >>>
> >>
> >>
>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: HBase crash, need help getting back up

Reply via email to