Re: How to speedup Hbase query throughput

Weihua JIANG Thu, 19 May 2011 17:09:14 -0700

Sorry for missing the background.

We assume user is more interested in his latest bills than his old
bills. Thus, the query generator is worked as below:
1. randomly generate a number and reverse it as user id.
2. randomly generate a prioritied month based on the above assumpation.
3. ask HBase to query this user + month.


Thanks
Weihua

2011/5/20 Matt Corgan <[email protected]>:
> I think i traced this to a bug in my compaction scheduler that would have
> missed scheduling about half the regions, hence the 240gb vs 480gb.  To
> confirm: major compaction will always run when asked, even if the region is
> already major compacted, the table settings haven't changed, and it was last
> major compacted on that same server.  [potential hbase optimization here for
> clusters with many cold regions].  So my theory about not localizing blocks
> is false.
>
> Weihua - why do you think your throughput doubled when you went from
> user+month to month+user keys?  Are your queries using an even distribution
> of months?  I'm not exactly clear on your schema or query pattern.
>
>
> On Thu, May 19, 2011 at 8:39 AM, Joey Echeverria <[email protected]> wrote:
>
>> I'm surprised the major compactions didn't balance the cluster better.
>> I wonder if you've stumbled upon a bug in HBase that's causing it to
>> leak old HFiles.
>>
>> Is the total amount of data in HDFS what you expect?
>>
>> -Joey
>>
>> On Thu, May 19, 2011 at 8:35 AM, Matt Corgan <[email protected]> wrote:
>> > that's right
>> >
>> >
>> > On Thu, May 19, 2011 at 8:23 AM, Joey Echeverria <[email protected]>
>> wrote:
>> >
>> >> Am I right to assume that all of your data is in HBase, ie you don't
>> >> keep anything in just HDFS files?
>> >>
>> >> -Joey
>> >>
>> >> On Thu, May 19, 2011 at 8:15 AM, Matt Corgan <[email protected]>
>> wrote:
>> >> > I wanted to do some more investigation before posting to the list, but
>> it
>> >> > seems relevant to this conversation...
>> >> >
>> >> > Is it possible that major compactions don't always localize the data
>> >> blocks?
>> >> >  Our cluster had a bunch of regions full of historical analytics data
>> >> that
>> >> > were already major compacted, then we added a new
>> datanode/regionserver.
>> >>  We
>> >> > have a job that triggers major compactions at a minimum of once per
>> week
>> >> by
>> >> > hashing the region name and giving it a time slot.  It's been several
>> >> weeks
>> >> > and the original nodes each have ~480gb used in hdfs, while the new
>> node
>> >> has
>> >> > only 240gb.  Regions are scattered pretty randomly and evenly among
>> the
>> >> > regionservers.
>> >> >
>> >> > The job calls hBaseAdmin.majorCompact(hRegionInfo.getRegionName());
>> >> >
>> >> > My guess is that if a region is already major compacted and no new
>> data
>> >> has
>> >> > been added to it, then compaction is skipped.  That's definitely an
>> >> > essential feature during typical operation, but it's a problem if
>> you're
>> >> > relying on major compaction to balance the cluster.
>> >> >
>> >> > Matt
>> >> >
>> >> >
>> >> > On Thu, May 19, 2011 at 4:42 AM, Michel Segel <
>> [email protected]
>> >> >wrote:
>> >> >
>> >> >> I had asked the question about how he created random keys... Hadn't
>> seen
>> >> a
>> >> >> response.
>> >> >>
>> >> >> Sent from a remote device. Please excuse any typos...
>> >> >>
>> >> >> Mike Segel
>> >> >>
>> >> >> On May 18, 2011, at 11:27 PM, Stack <[email protected]> wrote:
>> >> >>
>> >> >> > On Wed, May 18, 2011 at 5:11 PM, Weihua JIANG <
>> [email protected]
>> >> >
>> >> >> wrote:
>> >> >> >> All the DNs almost have the same number of blocks. Major
>> compaction
>> >> >> >> makes no difference.
>> >> >> >>
>> >> >> >
>> >> >> > I would expect major compaction to even the number of blocks across
>> >> >> > the cluster and it'd move the data for each region local to the
>> >> >> > regionserver.
>> >> >> >
>> >> >> > The only explanation that I can see is that the hot DNs must be
>> >> >> > carrying the hot blocks (The client querys are not random).  I do
>> not
>> >> >> > know what else it could be.
>> >> >> >
>> >> >> > St.Ack
>> >> >> >
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Joseph Echeverria
>> >> Cloudera, Inc.
>> >> 443.305.9434
>> >>
>> >
>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>>
>

Re: How to speedup Hbase query throughput

Reply via email to