When I did the test, I found the CPU went to 100% for the compaction thread.
while other cores in the
OS is not fully utilized. I checked the hbase code and found there is only
one single thread that does
compaction. That needs to be changed. I found that 0.21 version of hbase
does have this plan to
use multiple thread to do so, but it appears it will take a long time for us
to get there.
Another issue that I found with hbase is that when the number of region
reaches around 1000 for each
regionserver, the regionserver begin to shutdown by itself for various
reasons. most of the time
IO issue with data node. occasionally session expiration issue with
zookeeper. This is true regardless
of what key I use for the table.
Jimmy.
--------------------------------------------------
From: "Jeff Whiting" <[email protected]>
Sent: Friday, September 10, 2010 9:44 AM
To: <[email protected]>
Subject: Re: ycsb test on hbase
We were having the exact same problem when we were doing our own load
testing with hbase. We found that a memstore would reach its
hbase.hstore.blockingStoreFiles limit or its
hbase.hregion.memstore.block.multiplier. Hitting either of those limits
prevents writes to the specific region and the client would have to pause
until a compaction could come through and clean stuff up.
However the biggest problem is that there would be a descent size
compaction queue, we'd hit one of those limits, and then get put on the
*back* of the queue and would have to wait *minutes* before it finally got
to do the compaction we needed to stop the blocking. I created a jira to
address the issue HBASE-2646. There is a patch in the jira for 0.20.4
that creates a priority compaction queue that greatly helped our problem.
In fact we saw little to no pausing after applying the patch. In the
comments of the jira you can see some of the settings we used to prevent
the problem without the patch.
Apparently there is some work going on to do concurrent priority
compaction (Jonathan Gray has been working on it) but I haven't seen
anything yet in hbase and don't know the time line. My personal opinion
is that we should integrate the patch into trunk and use it until the more
advanced compactions are implemented.
~Jeff
On 9/10/2010 2:27 AM, Jeff Hammerbacher wrote:
We've been brainstorming some ideas to "smooth out" these performance
lapses, so instead of getting a 10 second period of unavailability, you
get
a 30 second period of slower performance, which is usually preferable.
Where is this brainstorming taking place? Could we open a JIRA issue to
capture the brainstorming in public and searchable fashion?
--
Jeff Whiting
Qualtrics Senior Software Engineer
[email protected]