And a log snippet from the regionserver at that time would help James... thanks.
St.Ack

On Mon, Oct 4, 2010 at 8:53 AM, James Baldassari <[email protected]> wrote:
> It happened again this morning, and this time I have full jstacks.  I didn't
> realize jstack had to be run as the same user that owns the process.
>
> Here's one of the region servers: http://pastebin.com/VeWXDQcu
> And the master: http://pastebin.com/pk1eAszJ
>
> These seem to indicate that most threads are waiting on take(), which I
> guess means they're idle waiting for requests to come in?  That sounds
> strange to me because I know the clients are trying to send requests.
>
> -James
>
>
> On Mon, Oct 4, 2010 at 10:18 AM, James Baldassari 
> <[email protected]>wrote:
>
>> Thanks for the tip, Ryan.  The cluster got into that weird state again last
>> night, and I tried to jstack everything.  I did have some trouble, though.
>> It only worked with the -F flag, and even then I couldn't get any stack
>> traces.  According to the docs, the fact that I needed to use -F means that
>> the JVM was hung for some reason.  I'm not really sure what could cause
>> that.  Like I mentioned before, I don't see any long GC pauses in the logs.
>>
>> Here is the jstack output I was able to get for one of the region servers:
>> http://pastebin.com/A9W1ti5S
>> And the master: http://pastebin.com/jb2cvmFC
>>
>> Both indicate that all the threads are blocked except one.  I also got a
>> thread dump on a couple of the region servers.  Here's one:
>> http://pastebin.com/KkWcY5mf
>>
>> It looks like most of the threads are blocked in
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get or
>> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.release.  Is that
>> normal?
>>
>> Thanks,
>> James
>>
>>
>>
>> On Sun, Oct 3, 2010 at 11:55 PM, Ryan Rawson <[email protected]> wrote:
>>
>>> During the event try jstack'ing the affected regionservers. That is
>>> usually
>>> extremely illuminating.
>>> On Oct 3, 2010 8:06 PM, "James Baldassari" <[email protected]> wrote:
>>> > Hi,
>>> >
>>> > We've been having a strange problem with our HBase cluster recently
>>> (0.20.5
>>> > + HBASE-2599 + IHBase-0.20.5). Everything will be working fine, doing
>>> > mostly gets at 5-10k/sec and an hourly bulk insert (using HTable puts)
>>> that
>>> > can spike the total throughput up to 15-50k ops/sec, but at some point
>>> the
>>> > cluster gets into this state where the request throughput (gets and
>>> puts)
>>> > drops to zero across 5 of our 6 region servers. Restarting the whole
>>> > cluster is the only way to fix the problem, but it gets back into that
>>> bad
>>> > state again after 4-12 hours.
>>> >
>>> > Nothing in the region server or master logs indicates any errors except
>>> > occasional DFS client timeouts. The logs look exactly like they do
>>> during
>>> > normal operation, even with debug logging on. I have GC logging on as
>>> well,
>>> > and there are no long GC pauses (the region servers have 11G of heap).
>>> When
>>> > the request rate drops the load is low on the region servers, there is
>>> > little to no I/O wait, and there are no messages in the region server
>>> logs
>>> > indicating that the region servers are busy doing anything like a
>>> > compaction. It seems like the region servers just decided to stop
>>> > processing requests. We have three different client applications sending
>>> > requests to HBase, and they all drop to zero requests/second at the same
>>> > time, so I don't think it's an issue on the client side. There are no
>>> > errors in our client logs either.
>>> >
>>> > Our hbase-site.xml is here: http://pastebin.com/cJ4cnH5W
>>> >
>>> > Any ideas what could be causing the cluster to freeze up? I guess my
>>> next
>>> > plan is to get thread dumps on the region servers and the clients the
>>> next
>>> > time it happens. Is there somewhere else I should look other than the
>>> > master and region server logs?
>>> >
>>> > Thanks,
>>> > James
>>>
>>
>>
>

Reply via email to