Re: HBase load distribution vs. scan efficiency

James Taylor Mon, 20 Jan 2014 17:17:20 -0800

Hi William,
Phoenix uses this "bucket mod" solution as well (
http://phoenix.incubator.apache.org/salted.html). For the scan, you have to
run it in every possible bucket. You can still do a range scan, you just
have to prepend the bucket number to the start/stop key of each scan you
do, and then you do a merge sort with the results. Phoenix does all this
transparently for you.
Thanks,
James



On Mon, Jan 20, 2014 at 4:51 PM, William Kang <[email protected]>wrote:

> Hi,
> Thank you guys. This is an informative email chain.
>
> I have one follow up question about using the "bucket mod" solution. Once
> you add the bucket number as the prefix to the key, how do you retrieve the
> rows? Do you have to use a rowfilter? Will there be any performance issue
> of using the row filter since it seems that would be a full table scan?
>
> Many thanks.
>
>
> William
>
>
> On Mon, Jan 20, 2014 at 5:06 AM, Amit Sela <[email protected]> wrote:
>
> > The number of scans depends on the number of regions a day's data uses.
> You
> > need to manage compaction and splitting manually.
> > If a days data is 100MB and you want regions to be no more than 200MB
> than
> > it's two regions to scan per day, if it's 1GB than 10 etc.
> > Compression will help you maximize data per region and as I've recently
> > learned, if your key occupies most of the byes in KeyValue (key is longer
> > than family, qualifier and value) than compression can be very
> efficient, I
> > have a case where 100GB is compressed to 7.
> >
> >
> >
> > On Mon, Jan 20, 2014 at 6:56 AM, Vladimir Rodionov
> > <[email protected]>wrote:
> >
> > > Ted, how does it differ from row key salting?
> > >
> > > Best regards,
> > > Vladimir Rodionov
> > > Principal Platform Engineer
> > > Carrier IQ, www.carrieriq.com
> > > e-mail: [email protected]
> > >
> > > ________________________________________
> > > From: Ted Yu [[email protected]]
> > > Sent: Sunday, January 19, 2014 6:53 PM
> > > To: [email protected]
> > > Subject: Re: HBase load distribution vs. scan efficiency
> > >
> > > Bill:
> > > See  http://blog.sematext.com/2012/04/09/hbasewd
> > >
> > >
> >
> -avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/
> > >
> > > FYI
> > >
> > >
> > > On Sun, Jan 19, 2014 at 4:02 PM, Bill Q <[email protected]> wrote:
> > >
> > > > Hi Amit,
> > > > Thanks for the reply.
> > > >
> > > > If I understand your suggestion correctly, and assuming we have 100
> > > region
> > > > servers, I would have to do 100 scans to merge reads if I want to
> pull
> > > any
> > > > data for a specific date. Is that correct? Is the 100 scans the most
> > > > efficient way to deal with this issue?
> > > >
> > > > Any thoughts?
> > > >
> > > > Many thanks.
> > > >
> > > >
> > > > Bill
> > > >
> > > >
> > > > On Sun, Jan 19, 2014 at 4:02 PM, Amit Sela <[email protected]>
> > wrote:
> > > >
> > > > > If you'll use bulk load to insert your data you could use the date
> as
> > > key
> > > > > prefix and choose the rest of the key in a way that will split each
> > day
> > > > > evenly. You'll have X regions for Evey day >> 14X regions for the
> two
> > > > weeks
> > > > > window.
> > > > > On Jan 19, 2014 8:39 PM, "Bill Q" <[email protected]> wrote:
> > > > >
> > > > > > Hi,
> > > > > > I am designing a schema to host some large volume of data over
> > HBase.
> > > > We
> > > > > > collect daily trading data for some markets. And we run a moving
> > > window
> > > > > > analysis to make predictions based on a two weeks window.
> > > > > >
> > > > > > Since everybody is going to pull the latest two weeks data every
> > day,
> > > > if
> > > > > we
> > > > > > put the date in the lead positions of the Key, we will have some
> > hot
> > > > > > regions. So, we can use bucketing (date to mode bucket number)
> > > approach
> > > > > to
> > > > > > deal with this situation. However, if we have 200 buckets, we
> need
> > to
> > > > run
> > > > > > 200 scans to extract all the data in the last two weeks.
> > > > > >
> > > > > > My questions are:
> > > > > > 1. What happens when each scan return the result? Will the scan
> > > result
> > > > be
> > > > > > sent to a sink  like place that collects and concatenate all the
> > scan
> > > > > > results?
> > > > > > 2. Why having 200 scans might be a bad thing compared to have
> only
> > 10
> > > > > > scans?
> > > > > > 3. Any suggestions to the design?
> > > > > >
> > > > > > Many thanks.
> > > > > >
> > > > > >
> > > > > > Bill
> > > > > >
> > > > >
> > > >
> > >
> > > Confidentiality Notice:  The information contained in this message,
> > > including any attachments hereto, may be confidential and is intended
> to
> > be
> > > read only by the individual or entity to whom this message is
> addressed.
> > If
> > > the reader of this message is not the intended recipient or an agent or
> > > designee of the intended recipient, please note that any review, use,
> > > disclosure or distribution of this message or its attachments, in any
> > form,
> > > is strictly prohibited.  If you have received this message in error,
> > please
> > > immediately notify the sender and/or [email protected] and
> > > delete or destroy any copy of this message and its attachments.
> > >
> >
>

Re: HBase load distribution vs. scan efficiency

Reply via email to