And a log snippet from the regionserver at that time would help James... thanks. St.Ack
On Mon, Oct 4, 2010 at 8:53 AM, James Baldassari <[email protected]> wrote: > It happened again this morning, and this time I have full jstacks. I didn't > realize jstack had to be run as the same user that owns the process. > > Here's one of the region servers: http://pastebin.com/VeWXDQcu > And the master: http://pastebin.com/pk1eAszJ > > These seem to indicate that most threads are waiting on take(), which I > guess means they're idle waiting for requests to come in? That sounds > strange to me because I know the clients are trying to send requests. > > -James > > > On Mon, Oct 4, 2010 at 10:18 AM, James Baldassari > <[email protected]>wrote: > >> Thanks for the tip, Ryan. The cluster got into that weird state again last >> night, and I tried to jstack everything. I did have some trouble, though. >> It only worked with the -F flag, and even then I couldn't get any stack >> traces. According to the docs, the fact that I needed to use -F means that >> the JVM was hung for some reason. I'm not really sure what could cause >> that. Like I mentioned before, I don't see any long GC pauses in the logs. >> >> Here is the jstack output I was able to get for one of the region servers: >> http://pastebin.com/A9W1ti5S >> And the master: http://pastebin.com/jb2cvmFC >> >> Both indicate that all the threads are blocked except one. I also got a >> thread dump on a couple of the region servers. Here's one: >> http://pastebin.com/KkWcY5mf >> >> It looks like most of the threads are blocked in >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get or >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.release. Is that >> normal? >> >> Thanks, >> James >> >> >> >> On Sun, Oct 3, 2010 at 11:55 PM, Ryan Rawson <[email protected]> wrote: >> >>> During the event try jstack'ing the affected regionservers. That is >>> usually >>> extremely illuminating. >>> On Oct 3, 2010 8:06 PM, "James Baldassari" <[email protected]> wrote: >>> > Hi, >>> > >>> > We've been having a strange problem with our HBase cluster recently >>> (0.20.5 >>> > + HBASE-2599 + IHBase-0.20.5). Everything will be working fine, doing >>> > mostly gets at 5-10k/sec and an hourly bulk insert (using HTable puts) >>> that >>> > can spike the total throughput up to 15-50k ops/sec, but at some point >>> the >>> > cluster gets into this state where the request throughput (gets and >>> puts) >>> > drops to zero across 5 of our 6 region servers. Restarting the whole >>> > cluster is the only way to fix the problem, but it gets back into that >>> bad >>> > state again after 4-12 hours. >>> > >>> > Nothing in the region server or master logs indicates any errors except >>> > occasional DFS client timeouts. The logs look exactly like they do >>> during >>> > normal operation, even with debug logging on. I have GC logging on as >>> well, >>> > and there are no long GC pauses (the region servers have 11G of heap). >>> When >>> > the request rate drops the load is low on the region servers, there is >>> > little to no I/O wait, and there are no messages in the region server >>> logs >>> > indicating that the region servers are busy doing anything like a >>> > compaction. It seems like the region servers just decided to stop >>> > processing requests. We have three different client applications sending >>> > requests to HBase, and they all drop to zero requests/second at the same >>> > time, so I don't think it's an issue on the client side. There are no >>> > errors in our client logs either. >>> > >>> > Our hbase-site.xml is here: http://pastebin.com/cJ4cnH5W >>> > >>> > Any ideas what could be causing the cluster to freeze up? I guess my >>> next >>> > plan is to get thread dumps on the region servers and the clients the >>> next >>> > time it happens. Is there somewhere else I should look other than the >>> > master and region server logs? >>> > >>> > Thanks, >>> > James >>> >> >> >
