Looks like time was spent in HRegion.internalFlushcache(). There is not so much information from server log.
I suggest you try out the patch in: HBASE-6466 Enable multi-thread for memstore flush It has been running in Taobao's production system. BTW Have you noticed the following in the log ? 2012-08-10 17:30:17,927 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: Too many consecutive RollWriter requests, it's a sign of the total number of live datanodes is lower than the tolerable replicas. How many datanodes did you use ? Thanks On Fri, Aug 10, 2012 at 3:12 PM, Espinoza,Carlos <[email protected]> wrote: > We were able to replicate this behavior in a pseudo-distributed hbase > (hbase-0.94.1) environment. We wrote a test program that creates a test > table "MyTestTable" and populates it with random rows, then it creates a > row with 60,000 columns and repeatedly updates it. Each column has a 18 > byte qualifier and a 50 byte value. In our tests, when we ran the > program, we usually never got beyond 15 updates before it would flush > for a really long time. The rows that are being updated are about 4MB > each (minues any hbase metadata). > > It doesn't seem like it's caused by GC. I turned on gc logging, and > didn't see any long pauses. This is the gc log during the flush. > http://pastebin.com/vJKKXDx5 > > This is the regionserver log with debug on during the same flush > http://pastebin.com/Fh5213mg > > This is the test program we wrote. > http://pastebin.com/aZ0k5tx2 > > You should be able to just compile it, and run it against a running > HBase cluster. > $ java TestTable > > Carlos > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of > Jean-Daniel Cryans > Sent: Monday, August 06, 2012 8:04 PM > To: [email protected] > Subject: Re: Extremely long flush times > > Hi Ron, > > I think this is caused by GC, for example there's 7 minutes between the > last blocking message and the end of the flush: > > 2012-08-06 11:44:36,348 INFO > org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC > Server handler 42 on 9009' on region > Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6.: > memstore size 271.9m is >= than blocking 256.0m size > 2012-08-06 11:51:28,139 INFO > org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of > ~155.4m/162905016, currentsize=116.5m/122172880 for region > Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in > 501337ms, sequenceid=1792264092, compaction requested=true > > During that time I would expect that clients got socket timeouts, tried > to write again, and data would just pile up in the RPC queue. > > Do you have the gc log handy? Also did you set the zookeeper session > timeout higher than the default of 3 minutes? > > Thx, > > J-D > > On Mon, Aug 6, 2012 at 2:37 PM, Buckley,Ron <[email protected]> wrote: > > We ran into this situation today on our production instance, I'm > > wondering if anyone has seen or knows a way around it. > > > > > > > > We were repeatedly updating the same 60,000 column wide row. With > > each update, we were updating nearly every column of the row. Each > > update generated nearly 9 MB in the WAL. > > > > This worked OK most of the time. We saw frequent messages like this in > > > the region server log, indicating a 2.6 second flush time: > > > > 2012-08-06 11:42:52,316 INFO > > org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush > > of ~155.4m/162931400, currentsize=2.7k/2760 for region > > Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in > > 2610ms, sequenceid=1792263953, compaction requested=false > > > > > > > > However, sometimes we would see a flush times around 8 - 10 Minutes. > > > > 2012-08-06 11:51:28,139 INFO > > org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush > > of ~155.4m/162905016, currentsize=116.5m/122172880 for region > > Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in > > 501337ms, sequenceid=1792264092, compaction requested=true > > > > > > > > During these long flushes, we would end up taking all the ipc handler > > slots into that region server that we had configured (50). It looked > > to our application like the region server was unavailable. > > > > > > > > After the long delay, the region server would continue running fine, > > but then do it again a few minutes later. > > > > > > > > I thought it was most likely something with Garbage Collection, so I > > got a stack trace during one of the events. The trace doesn't look > > like GC was running. > > > > > > > > > > > > http://pastebin.com/ANBiUzX4 - Stack Trace when the region server was > > > unreachable > > > > http://pastebin.com/naTYQxZR - Region Server Log during the same > > interval > > > > > > > > > > > > Versions: > > > > HBase 0.92.1, r1298924 > > > > Hadoop 0.20.2-cdh3u3 > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > Ron Buckley > > > > x6365 > > > > > > > > >
