We ran into this situation today on our production instance, I'm wondering if anyone has seen or knows a way around it.
We were repeatedly updating the same 60,000 column wide row. With each update, we were updating nearly every column of the row. Each update generated nearly 9 MB in the WAL. This worked OK most of the time. We saw frequent messages like this in the region server log, indicating a 2.6 second flush time: 2012-08-06 11:42:52,316 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~155.4m/162931400, currentsize=2.7k/2760 for region Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in 2610ms, sequenceid=1792263953, compaction requested=false However, sometimes we would see a flush times around 8 - 10 Minutes. 2012-08-06 11:51:28,139 INFO org.apache.hadoop.hbase.regionserver.HRegion: Finished memstore flush of ~155.4m/162905016, currentsize=116.5m/122172880 for region Works,73658752:eng,1337915462169.1cfc92481a74ec278337871563d189d6. in 501337ms, sequenceid=1792264092, compaction requested=true During these long flushes, we would end up taking all the ipc handler slots into that region server that we had configured (50). It looked to our application like the region server was unavailable. After the long delay, the region server would continue running fine, but then do it again a few minutes later. I thought it was most likely something with Garbage Collection, so I got a stack trace during one of the events. The trace doesn't look like GC was running. http://pastebin.com/ANBiUzX4 - Stack Trace when the region server was unreachable http://pastebin.com/naTYQxZR - Region Server Log during the same interval Versions: HBase 0.92.1, r1298924 Hadoop 0.20.2-cdh3u3 ---------------------------------------------------------------------- Ron Buckley x6365
