Hi Ron, I think this is caused by GC, for example there's 7 minutes between the last blocking message and the end of the flush:
2012-08-06 11:44:36,348 INFO org.apache.hadoop.hbase.regionserver.HRegion: Blocking updates for 'IPC Server handler 42 on 9009' on region Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6.: memstore size 271.9m is >= than blocking 256.0m size 2012-08-06 11:51:28,139 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~155.4m/162905016, currentsize=116.5m/122172880 for region Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in 501337ms, sequenceid=1792264092, compaction requested=true During that time I would expect that clients got socket timeouts, tried to write again, and data would just pile up in the RPC queue. Do you have the gc log handy? Also did you set the zookeeper session timeout higher than the default of 3 minutes? Thx, J-D On Mon, Aug 6, 2012 at 2:37 PM, Buckley,Ron <[email protected]> wrote: > We ran into this situation today on our production instance, I'm > wondering if anyone has seen or knows a way around it. > > > > We were repeatedly updating the same 60,000 column wide row. With each > update, we were updating nearly every column of the row. Each update > generated nearly 9 MB in the WAL. > > This worked OK most of the time. We saw frequent messages like this in > the region server log, indicating a 2.6 second flush time: > > 2012-08-06 11:42:52,316 INFO > org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of > ~155.4m/162931400, currentsize=2.7k/2760 for region > Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in > 2610ms, sequenceid=1792263953, compaction requested=false > > > > However, sometimes we would see a flush times around 8 - 10 Minutes. > > 2012-08-06 11:51:28,139 INFO > org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of > ~155.4m/162905016, currentsize=116.5m/122172880 for region > Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in > 501337ms, sequenceid=1792264092, compaction requested=true > > > > During these long flushes, we would end up taking all the ipc handler > slots into that region server that we had configured (50). It looked to > our application like the region server was unavailable. > > > > After the long delay, the region server would continue running fine, but > then do it again a few minutes later. > > > > I thought it was most likely something with Garbage Collection, so I got > a stack trace during one of the events. The trace doesn't look like GC > was running. > > > > > > http://pastebin.com/ANBiUzX4 - Stack Trace when the region server was > unreachable > > http://pastebin.com/naTYQxZR - Region Server Log during the same > interval > > > > > > Versions: > > HBase 0.92.1, r1298924 > > Hadoop 0.20.2-cdh3u3 > > > > > > ---------------------------------------------------------------------- > > Ron Buckley > > x6365 > > >
