Re: multiple partial scans in the row

Ian Varley Tue, 14 Feb 2012 10:01:46 -0800

James,

Are your orderIds ordered? You say "a range of orderIds", which implies that 
(i.e. they're sequential numbers like 001, 002, etc, not hashes or random 
values). If so, then a single scan can hit the rows for multiple contiguous 
orderIds (you'd set the start and stop rows based on a prefix of the row key 
that's just the length of the orderid).


Another question: are the time ranges you're scanning a big or small proportion 
of all the rows for each order id? If you generally expect to return a majority 
of the rows per each order, then a single scan (starting with the lowest 
orderId, and proceeding to the highest) is possibly still a good fit. You can 
also apply timestamp filters (which enables an optimization to exclude 
storefiles that couldn't possibly contain values in that timestamp range); that 
only works if the timestamps on your cells match the timestamp in the row key.

Alternately, if you expect to return only a small portion of the records (i.e. 
you keep a lot of items with a wide range of timestamps in each orderId, but 
you only want to retrieve a small set of them), you might want to do one scan 
per orderId. You can choose how much parallelism to put into it by controlling 
that yourself (i.e. use a thread per scan on the client side); you could 
theoretically do a thread per order id, but of course, if you have a very large 
number of them, that could be harmful.

A regular expression doesn't get you past the fundamental requirement, which is 
that at the server side, it has to look at every row (excepting special 
optimizations like the timestamp one I mentioned above).

Your best bet is to implement it a couple ways, with real data, and see which 
ones seem to work the fastest.

Ian

On Feb 14, 2012, at 11:45 AM, James Young wrote:

Hi there,

I am pretty new to HBase and i am trying to understand the best
practice to do the scan based on two/multiple partial scans for the
row key.

For example, I have a row key like:  orderId-timeStamp-item. The
orderId has nothing to with the timeStamp and i have a requirement to
scan rows for certain orderIds ( a range of orderIds)  within certain
time period.    I am not sure if it is possible  to perform two
partial scan: one is for orderId and another one is for the timeStamp.

Also, doing regular expression on the row key might work out.  But it
is more expensive. so I am wondering what would be the best practice
for solving such a problem.


Thanks in advance,

James

Re: multiple partial scans in the row

Reply via email to