Thank you Ian! Yes, the orderIds are ordered. I might try timeStamp filter. But it still doesn't provide the early out feature. not sure how the performance it could be. Do you think it might be worth having a custom filter to do two partial scans?
Thanks again. James On Wed, Feb 15, 2012 at 2:01 AM, Ian Varley <[email protected]> wrote: > James, > > Are your orderIds ordered? You say "a range of orderIds", which implies that > (i.e. they're sequential numbers like 001, 002, etc, not hashes or random > values). If so, then a single scan can hit the rows for multiple contiguous > orderIds (you'd set the start and stop rows based on a prefix of the row key > that's just the length of the orderid). > > Another question: are the time ranges you're scanning a big or small > proportion of all the rows for each order id? If you generally expect to > return a majority of the rows per each order, then a single scan (starting > with the lowest orderId, and proceeding to the highest) is possibly still a > good fit. You can also apply timestamp filters (which enables an optimization > to exclude storefiles that couldn't possibly contain values in that timestamp > range); that only works if the timestamps on your cells match the timestamp > in the row key. > > Alternately, if you expect to return only a small portion of the records > (i.e. you keep a lot of items with a wide range of timestamps in each > orderId, but you only want to retrieve a small set of them), you might want > to do one scan per orderId. You can choose how much parallelism to put into > it by controlling that yourself (i.e. use a thread per scan on the client > side); you could theoretically do a thread per order id, but of course, if > you have a very large number of them, that could be harmful. > > A regular expression doesn't get you past the fundamental requirement, which > is that at the server side, it has to look at every row (excepting special > optimizations like the timestamp one I mentioned above). > > Your best bet is to implement it a couple ways, with real data, and see which > ones seem to work the fastest. > > Ian > > On Feb 14, 2012, at 11:45 AM, James Young wrote: > > Hi there, > > I am pretty new to HBase and i am trying to understand the best > practice to do the scan based on two/multiple partial scans for the > row key. > > For example, I have a row key like: orderId-timeStamp-item. The > orderId has nothing to with the timeStamp and i have a requirement to > scan rows for certain orderIds ( a range of orderIds) within certain > time period. I am not sure if it is possible to perform two > partial scan: one is for orderId and another one is for the timeStamp. > > Also, doing regular expression on the row key might work out. But it > is more expensive. so I am wondering what would be the best practice > for solving such a problem. > > > Thanks in advance, > > James >
