Hey Stack.  Here's the region server log from this morning's crash:
http://pastebin.com/b7cEUT3U

Not much happening there.  I also found the log from last night's crash,
which appears to be more interesting: http://pastebin.com/8VqpUYSV

It looks like it's having some problems doing ICVs, and there was this weird
error:

2010-10-04 00:25:59,876 WARN org.apache.hadoop.hbase.regionserver.Store:
Failed open of
hdfs://rts-nn01.sldc.[domain].net:50001/hbase/users/1958649137/data/5261945116444723281;
presumption is that file was corrupted at flush and lost edits picked up by
commit log replay. Verify!
java.io.IOException: Trailer 'header' is wrong; does the trailer size match
content?

You can see that I had to kill the RS and restart it near the end of the
snippet.  I wonder if this problem has anything to do with IHBase because an
index scan was running around the time of the crash.  Everything was stable
before our release about a week ago, which included introducing IHBase.  We
also added a couple new region servers and a new client app, so that wasn't
the only change.  Still, I think I might try removing IHBase temporarily to
see if that improves things.

-James


On Mon, Oct 4, 2010 at 1:26 PM, Stack <[email protected]> wrote:

> And a log snippet from the regionserver at that time would help James...
> thanks.
> St.Ack
>
> On Mon, Oct 4, 2010 at 8:53 AM, James Baldassari <[email protected]>
> wrote:
> > It happened again this morning, and this time I have full jstacks.  I
> didn't
> > realize jstack had to be run as the same user that owns the process.
> >
> > Here's one of the region servers: http://pastebin.com/VeWXDQcu
> > And the master: http://pastebin.com/pk1eAszJ
> >
> > These seem to indicate that most threads are waiting on take(), which I
> > guess means they're idle waiting for requests to come in?  That sounds
> > strange to me because I know the clients are trying to send requests.
> >
> > -James
> >
> >
> > On Mon, Oct 4, 2010 at 10:18 AM, James Baldassari <[email protected]
> >wrote:
> >
> >> Thanks for the tip, Ryan.  The cluster got into that weird state again
> last
> >> night, and I tried to jstack everything.  I did have some trouble,
> though.
> >> It only worked with the -F flag, and even then I couldn't get any stack
> >> traces.  According to the docs, the fact that I needed to use -F means
> that
> >> the JVM was hung for some reason.  I'm not really sure what could cause
> >> that.  Like I mentioned before, I don't see any long GC pauses in the
> logs.
> >>
> >> Here is the jstack output I was able to get for one of the region
> servers:
> >> http://pastebin.com/A9W1ti5S
> >> And the master: http://pastebin.com/jb2cvmFC
> >>
> >> Both indicate that all the threads are blocked except one.  I also got a
> >> thread dump on a couple of the region servers.  Here's one:
> >> http://pastebin.com/KkWcY5mf
> >>
> >> It looks like most of the threads are blocked in
> >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get or
> >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.release.  Is that
> >> normal?
> >>
> >> Thanks,
> >> James
> >>
> >>
> >>
> >> On Sun, Oct 3, 2010 at 11:55 PM, Ryan Rawson <[email protected]>
> wrote:
> >>
> >>> During the event try jstack'ing the affected regionservers. That is
> >>> usually
> >>> extremely illuminating.
> >>> On Oct 3, 2010 8:06 PM, "James Baldassari" <[email protected]>
> wrote:
> >>> > Hi,
> >>> >
> >>> > We've been having a strange problem with our HBase cluster recently
> >>> (0.20.5
> >>> > + HBASE-2599 + IHBase-0.20.5). Everything will be working fine, doing
> >>> > mostly gets at 5-10k/sec and an hourly bulk insert (using HTable
> puts)
> >>> that
> >>> > can spike the total throughput up to 15-50k ops/sec, but at some
> point
> >>> the
> >>> > cluster gets into this state where the request throughput (gets and
> >>> puts)
> >>> > drops to zero across 5 of our 6 region servers. Restarting the whole
> >>> > cluster is the only way to fix the problem, but it gets back into
> that
> >>> bad
> >>> > state again after 4-12 hours.
> >>> >
> >>> > Nothing in the region server or master logs indicates any errors
> except
> >>> > occasional DFS client timeouts. The logs look exactly like they do
> >>> during
> >>> > normal operation, even with debug logging on. I have GC logging on as
> >>> well,
> >>> > and there are no long GC pauses (the region servers have 11G of
> heap).
> >>> When
> >>> > the request rate drops the load is low on the region servers, there
> is
> >>> > little to no I/O wait, and there are no messages in the region server
> >>> logs
> >>> > indicating that the region servers are busy doing anything like a
> >>> > compaction. It seems like the region servers just decided to stop
> >>> > processing requests. We have three different client applications
> sending
> >>> > requests to HBase, and they all drop to zero requests/second at the
> same
> >>> > time, so I don't think it's an issue on the client side. There are no
> >>> > errors in our client logs either.
> >>> >
> >>> > Our hbase-site.xml is here: http://pastebin.com/cJ4cnH5W
> >>> >
> >>> > Any ideas what could be causing the cluster to freeze up? I guess my
> >>> next
> >>> > plan is to get thread dumps on the region servers and the clients the
> >>> next
> >>> > time it happens. Is there somewhere else I should look other than the
> >>> > master and region server logs?
> >>> >
> >>> > Thanks,
> >>> > James
> >>>
> >>
> >>
> >
>

Reply via email to