Re: multiple partial scans in the row

James Young Tue, 14 Feb 2012 18:31:03 -0800

Thank you Ian! Yes, the orderIds are ordered.

I might try timeStamp filter. But it still doesn't provide the early
out feature. not sure how the performance it could be. Do you think it
might be worth having a custom filter to do two partial scans?


Thanks again.
James

On Wed, Feb 15, 2012 at 2:01 AM, Ian Varley <[email protected]> wrote:
> James,
>
> Are your orderIds ordered? You say "a range of orderIds", which implies that 
> (i.e. they're sequential numbers like 001, 002, etc, not hashes or random 
> values). If so, then a single scan can hit the rows for multiple contiguous 
> orderIds (you'd set the start and stop rows based on a prefix of the row key 
> that's just the length of the orderid).
>
> Another question: are the time ranges you're scanning a big or small 
> proportion of all the rows for each order id? If you generally expect to 
> return a majority of the rows per each order, then a single scan (starting 
> with the lowest orderId, and proceeding to the highest) is possibly still a 
> good fit. You can also apply timestamp filters (which enables an optimization 
> to exclude storefiles that couldn't possibly contain values in that timestamp 
> range); that only works if the timestamps on your cells match the timestamp 
> in the row key.
>
> Alternately, if you expect to return only a small portion of the records 
> (i.e. you keep a lot of items with a wide range of timestamps in each 
> orderId, but you only want to retrieve a small set of them), you might want 
> to do one scan per orderId. You can choose how much parallelism to put into 
> it by controlling that yourself (i.e. use a thread per scan on the client 
> side); you could theoretically do a thread per order id, but of course, if 
> you have a very large number of them, that could be harmful.
>
> A regular expression doesn't get you past the fundamental requirement, which 
> is that at the server side, it has to look at every row (excepting special 
> optimizations like the timestamp one I mentioned above).
>
> Your best bet is to implement it a couple ways, with real data, and see which 
> ones seem to work the fastest.
>
> Ian
>
> On Feb 14, 2012, at 11:45 AM, James Young wrote:
>
> Hi there,
>
> I am pretty new to HBase and i am trying to understand the best
> practice to do the scan based on two/multiple partial scans for the
> row key.
>
> For example, I have a row key like:  orderId-timeStamp-item. The
> orderId has nothing to with the timeStamp and i have a requirement to
> scan rows for certain orderIds ( a range of orderIds)  within certain
> time period.    I am not sure if it is possible  to perform two
> partial scan: one is for orderId and another one is for the timeStamp.
>
> Also, doing regular expression on the row key might work out.  But it
> is more expensive. so I am wondering what would be the best practice
> for solving such a problem.
>
>
> Thanks in advance,
>
> James
>

Re: multiple partial scans in the row

Reply via email to