If you'll use bulk load to insert your data you could use the date as key prefix and choose the rest of the key in a way that will split each day evenly. You'll have X regions for Evey day >> 14X regions for the two weeks window. On Jan 19, 2014 8:39 PM, "Bill Q" <[email protected]> wrote:
> Hi, > I am designing a schema to host some large volume of data over HBase. We > collect daily trading data for some markets. And we run a moving window > analysis to make predictions based on a two weeks window. > > Since everybody is going to pull the latest two weeks data every day, if we > put the date in the lead positions of the Key, we will have some hot > regions. So, we can use bucketing (date to mode bucket number) approach to > deal with this situation. However, if we have 200 buckets, we need to run > 200 scans to extract all the data in the last two weeks. > > My questions are: > 1. What happens when each scan return the result? Will the scan result be > sent to a sink like place that collects and concatenate all the scan > results? > 2. Why having 200 scans might be a bad thing compared to have only 10 > scans? > 3. Any suggestions to the design? > > Many thanks. > > > Bill >
