Hi William, Phoenix uses this "bucket mod" solution as well ( http://phoenix.incubator.apache.org/salted.html). For the scan, you have to run it in every possible bucket. You can still do a range scan, you just have to prepend the bucket number to the start/stop key of each scan you do, and then you do a merge sort with the results. Phoenix does all this transparently for you. Thanks, James
On Mon, Jan 20, 2014 at 4:51 PM, William Kang <[email protected]>wrote: > Hi, > Thank you guys. This is an informative email chain. > > I have one follow up question about using the "bucket mod" solution. Once > you add the bucket number as the prefix to the key, how do you retrieve the > rows? Do you have to use a rowfilter? Will there be any performance issue > of using the row filter since it seems that would be a full table scan? > > Many thanks. > > > William > > > On Mon, Jan 20, 2014 at 5:06 AM, Amit Sela <[email protected]> wrote: > > > The number of scans depends on the number of regions a day's data uses. > You > > need to manage compaction and splitting manually. > > If a days data is 100MB and you want regions to be no more than 200MB > than > > it's two regions to scan per day, if it's 1GB than 10 etc. > > Compression will help you maximize data per region and as I've recently > > learned, if your key occupies most of the byes in KeyValue (key is longer > > than family, qualifier and value) than compression can be very > efficient, I > > have a case where 100GB is compressed to 7. > > > > > > > > On Mon, Jan 20, 2014 at 6:56 AM, Vladimir Rodionov > > <[email protected]>wrote: > > > > > Ted, how does it differ from row key salting? > > > > > > Best regards, > > > Vladimir Rodionov > > > Principal Platform Engineer > > > Carrier IQ, www.carrieriq.com > > > e-mail: [email protected] > > > > > > ________________________________________ > > > From: Ted Yu [[email protected]] > > > Sent: Sunday, January 19, 2014 6:53 PM > > > To: [email protected] > > > Subject: Re: HBase load distribution vs. scan efficiency > > > > > > Bill: > > > See http://blog.sematext.com/2012/04/09/hbasewd > > > > > > > > > -avoid-regionserver-hotspotting-despite-writing-records-with-sequential-keys/ > > > > > > FYI > > > > > > > > > On Sun, Jan 19, 2014 at 4:02 PM, Bill Q <[email protected]> wrote: > > > > > > > Hi Amit, > > > > Thanks for the reply. > > > > > > > > If I understand your suggestion correctly, and assuming we have 100 > > > region > > > > servers, I would have to do 100 scans to merge reads if I want to > pull > > > any > > > > data for a specific date. Is that correct? Is the 100 scans the most > > > > efficient way to deal with this issue? > > > > > > > > Any thoughts? > > > > > > > > Many thanks. > > > > > > > > > > > > Bill > > > > > > > > > > > > On Sun, Jan 19, 2014 at 4:02 PM, Amit Sela <[email protected]> > > wrote: > > > > > > > > > If you'll use bulk load to insert your data you could use the date > as > > > key > > > > > prefix and choose the rest of the key in a way that will split each > > day > > > > > evenly. You'll have X regions for Evey day >> 14X regions for the > two > > > > weeks > > > > > window. > > > > > On Jan 19, 2014 8:39 PM, "Bill Q" <[email protected]> wrote: > > > > > > > > > > > Hi, > > > > > > I am designing a schema to host some large volume of data over > > HBase. > > > > We > > > > > > collect daily trading data for some markets. And we run a moving > > > window > > > > > > analysis to make predictions based on a two weeks window. > > > > > > > > > > > > Since everybody is going to pull the latest two weeks data every > > day, > > > > if > > > > > we > > > > > > put the date in the lead positions of the Key, we will have some > > hot > > > > > > regions. So, we can use bucketing (date to mode bucket number) > > > approach > > > > > to > > > > > > deal with this situation. However, if we have 200 buckets, we > need > > to > > > > run > > > > > > 200 scans to extract all the data in the last two weeks. > > > > > > > > > > > > My questions are: > > > > > > 1. What happens when each scan return the result? Will the scan > > > result > > > > be > > > > > > sent to a sink like place that collects and concatenate all the > > scan > > > > > > results? > > > > > > 2. Why having 200 scans might be a bad thing compared to have > only > > 10 > > > > > > scans? > > > > > > 3. Any suggestions to the design? > > > > > > > > > > > > Many thanks. > > > > > > > > > > > > > > > > > > Bill > > > > > > > > > > > > > > > > > > > > > Confidentiality Notice: The information contained in this message, > > > including any attachments hereto, may be confidential and is intended > to > > be > > > read only by the individual or entity to whom this message is > addressed. > > If > > > the reader of this message is not the intended recipient or an agent or > > > designee of the intended recipient, please note that any review, use, > > > disclosure or distribution of this message or its attachments, in any > > form, > > > is strictly prohibited. If you have received this message in error, > > please > > > immediately notify the sender and/or [email protected] and > > > delete or destroy any copy of this message and its attachments. > > > > > >
