When I did the test, I found the CPU went to 100% for the compaction thread. while other cores in the OS is not fully utilized. I checked the hbase code and found there is only one single thread that does compaction. That needs to be changed. I found that 0.21 version of hbase does have this plan to use multiple thread to do so, but it appears it will take a long time for us to get there.

Another issue that I found with hbase is that when the number of region reaches around 1000 for each regionserver, the regionserver begin to shutdown by itself for various reasons. most of the time IO issue with data node. occasionally session expiration issue with zookeeper. This is true regardless
of what key I use for the table.

Jimmy.

--------------------------------------------------
From: "Jeff Whiting" <[email protected]>
Sent: Friday, September 10, 2010 9:44 AM
To: <[email protected]>
Subject: Re: ycsb test on hbase

We were having the exact same problem when we were doing our own load testing with hbase. We found that a memstore would reach its hbase.hstore.blockingStoreFiles limit or its hbase.hregion.memstore.block.multiplier. Hitting either of those limits prevents writes to the specific region and the client would have to pause until a compaction could come through and clean stuff up.

However the biggest problem is that there would be a descent size compaction queue, we'd hit one of those limits, and then get put on the *back* of the queue and would have to wait *minutes* before it finally got to do the compaction we needed to stop the blocking. I created a jira to address the issue HBASE-2646. There is a patch in the jira for 0.20.4 that creates a priority compaction queue that greatly helped our problem. In fact we saw little to no pausing after applying the patch. In the comments of the jira you can see some of the settings we used to prevent the problem without the patch.

Apparently there is some work going on to do concurrent priority compaction (Jonathan Gray has been working on it) but I haven't seen anything yet in hbase and don't know the time line. My personal opinion is that we should integrate the patch into trunk and use it until the more advanced compactions are implemented.

~Jeff

On 9/10/2010 2:27 AM, Jeff Hammerbacher wrote:
We've been brainstorming some ideas to "smooth out" these performance
lapses, so instead of getting a 10 second period of unavailability, you get
a 30 second period of slower performance, which is usually preferable.

Where is this brainstorming taking place? Could we open a JIRA issue to
capture the brainstorming in public and searchable fashion?


--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]


Reply via email to