Hey Stack. Here's the region server log from this morning's crash: http://pastebin.com/b7cEUT3U
Not much happening there. I also found the log from last night's crash, which appears to be more interesting: http://pastebin.com/8VqpUYSV It looks like it's having some problems doing ICVs, and there was this weird error: 2010-10-04 00:25:59,876 WARN org.apache.hadoop.hbase.regionserver.Store: Failed open of hdfs://rts-nn01.sldc.[domain].net:50001/hbase/users/1958649137/data/5261945116444723281; presumption is that file was corrupted at flush and lost edits picked up by commit log replay. Verify! java.io.IOException: Trailer 'header' is wrong; does the trailer size match content? You can see that I had to kill the RS and restart it near the end of the snippet. I wonder if this problem has anything to do with IHBase because an index scan was running around the time of the crash. Everything was stable before our release about a week ago, which included introducing IHBase. We also added a couple new region servers and a new client app, so that wasn't the only change. Still, I think I might try removing IHBase temporarily to see if that improves things. -James On Mon, Oct 4, 2010 at 1:26 PM, Stack <[email protected]> wrote: > And a log snippet from the regionserver at that time would help James... > thanks. > St.Ack > > On Mon, Oct 4, 2010 at 8:53 AM, James Baldassari <[email protected]> > wrote: > > It happened again this morning, and this time I have full jstacks. I > didn't > > realize jstack had to be run as the same user that owns the process. > > > > Here's one of the region servers: http://pastebin.com/VeWXDQcu > > And the master: http://pastebin.com/pk1eAszJ > > > > These seem to indicate that most threads are waiting on take(), which I > > guess means they're idle waiting for requests to come in? That sounds > > strange to me because I know the clients are trying to send requests. > > > > -James > > > > > > On Mon, Oct 4, 2010 at 10:18 AM, James Baldassari <[email protected] > >wrote: > > > >> Thanks for the tip, Ryan. The cluster got into that weird state again > last > >> night, and I tried to jstack everything. I did have some trouble, > though. > >> It only worked with the -F flag, and even then I couldn't get any stack > >> traces. According to the docs, the fact that I needed to use -F means > that > >> the JVM was hung for some reason. I'm not really sure what could cause > >> that. Like I mentioned before, I don't see any long GC pauses in the > logs. > >> > >> Here is the jstack output I was able to get for one of the region > servers: > >> http://pastebin.com/A9W1ti5S > >> And the master: http://pastebin.com/jb2cvmFC > >> > >> Both indicate that all the threads are blocked except one. I also got a > >> thread dump on a couple of the region servers. Here's one: > >> http://pastebin.com/KkWcY5mf > >> > >> It looks like most of the threads are blocked in > >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.get or > >> org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.release. Is that > >> normal? > >> > >> Thanks, > >> James > >> > >> > >> > >> On Sun, Oct 3, 2010 at 11:55 PM, Ryan Rawson <[email protected]> > wrote: > >> > >>> During the event try jstack'ing the affected regionservers. That is > >>> usually > >>> extremely illuminating. > >>> On Oct 3, 2010 8:06 PM, "James Baldassari" <[email protected]> > wrote: > >>> > Hi, > >>> > > >>> > We've been having a strange problem with our HBase cluster recently > >>> (0.20.5 > >>> > + HBASE-2599 + IHBase-0.20.5). Everything will be working fine, doing > >>> > mostly gets at 5-10k/sec and an hourly bulk insert (using HTable > puts) > >>> that > >>> > can spike the total throughput up to 15-50k ops/sec, but at some > point > >>> the > >>> > cluster gets into this state where the request throughput (gets and > >>> puts) > >>> > drops to zero across 5 of our 6 region servers. Restarting the whole > >>> > cluster is the only way to fix the problem, but it gets back into > that > >>> bad > >>> > state again after 4-12 hours. > >>> > > >>> > Nothing in the region server or master logs indicates any errors > except > >>> > occasional DFS client timeouts. The logs look exactly like they do > >>> during > >>> > normal operation, even with debug logging on. I have GC logging on as > >>> well, > >>> > and there are no long GC pauses (the region servers have 11G of > heap). > >>> When > >>> > the request rate drops the load is low on the region servers, there > is > >>> > little to no I/O wait, and there are no messages in the region server > >>> logs > >>> > indicating that the region servers are busy doing anything like a > >>> > compaction. It seems like the region servers just decided to stop > >>> > processing requests. We have three different client applications > sending > >>> > requests to HBase, and they all drop to zero requests/second at the > same > >>> > time, so I don't think it's an issue on the client side. There are no > >>> > errors in our client logs either. > >>> > > >>> > Our hbase-site.xml is here: http://pastebin.com/cJ4cnH5W > >>> > > >>> > Any ideas what could be causing the cluster to freeze up? I guess my > >>> next > >>> > plan is to get thread dumps on the region servers and the clients the > >>> next > >>> > time it happens. Is there somewhere else I should look other than the > >>> > master and region server logs? > >>> > > >>> > Thanks, > >>> > James > >>> > >> > >> > > >
