Hi, I am designing a schema to host some large volume of data over HBase. We collect daily trading data for some markets. And we run a moving window analysis to make predictions based on a two weeks window.
Since everybody is going to pull the latest two weeks data every day, if we put the date in the lead positions of the Key, we will have some hot regions. So, we can use bucketing (date to mode bucket number) approach to deal with this situation. However, if we have 200 buckets, we need to run 200 scans to extract all the data in the last two weeks. My questions are: 1. What happens when each scan return the result? Will the scan result be sent to a sink like place that collects and concatenate all the scan results? 2. Why having 200 scans might be a bad thing compared to have only 10 scans? 3. Any suggestions to the design? Many thanks. Bill
