James,

Are your orderIds ordered? You say "a range of orderIds", which implies that 
(i.e. they're sequential numbers like 001, 002, etc, not hashes or random 
values). If so, then a single scan can hit the rows for multiple contiguous 
orderIds (you'd set the start and stop rows based on a prefix of the row key 
that's just the length of the orderid).

Another question: are the time ranges you're scanning a big or small proportion 
of all the rows for each order id? If you generally expect to return a majority 
of the rows per each order, then a single scan (starting with the lowest 
orderId, and proceeding to the highest) is possibly still a good fit. You can 
also apply timestamp filters (which enables an optimization to exclude 
storefiles that couldn't possibly contain values in that timestamp range); that 
only works if the timestamps on your cells match the timestamp in the row key.

Alternately, if you expect to return only a small portion of the records (i.e. 
you keep a lot of items with a wide range of timestamps in each orderId, but 
you only want to retrieve a small set of them), you might want to do one scan 
per orderId. You can choose how much parallelism to put into it by controlling 
that yourself (i.e. use a thread per scan on the client side); you could 
theoretically do a thread per order id, but of course, if you have a very large 
number of them, that could be harmful.

A regular expression doesn't get you past the fundamental requirement, which is 
that at the server side, it has to look at every row (excepting special 
optimizations like the timestamp one I mentioned above).

Your best bet is to implement it a couple ways, with real data, and see which 
ones seem to work the fastest.

Ian

On Feb 14, 2012, at 11:45 AM, James Young wrote:

Hi there,

I am pretty new to HBase and i am trying to understand the best
practice to do the scan based on two/multiple partial scans for the
row key.

For example, I have a row key like:  orderId-timeStamp-item. The
orderId has nothing to with the timeStamp and i have a requirement to
scan rows for certain orderIds ( a range of orderIds)  within certain
time period.    I am not sure if it is possible  to perform two
partial scan: one is for orderId and another one is for the timeStamp.

Also, doing regular expression on the row key might work out.  But it
is more expensive. so I am wondering what would be the best practice
for solving such a problem.


Thanks in advance,

James

Reply via email to