Re: setTimeRange and setMaxVersions seem to be inefficient

lars hofhansl Mon, 27 Aug 2012 21:55:22 -0700

First off regarding "inefficiency"... If version counting would happen first 
and then filter were executed we'd have folks "complaining" about 
inefficiencies as well:
("Why does the code have to go through the versioning stuff when my filter 
filters the row/column/version anyway?")  ;-)

For your problem, you want to make use of "seek hints"...

In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even 
SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).

That way the scanning framework will know to skip ahead to the next column, 
row, or a KV of your choosing. (see Filter.filterKeyValue and 
Filter.getNextKeyHint).

(as an aside, it would probably be nice if Filters also had 
INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by StoreScanner)

Have a look at ColumnPrefixFilter as an example.
I also wrote a short post here: 
http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html

Does that help?

-- Lars

----- Original Message -----
From: Jerry Lam <chiling...@gmail.com>
To: "user@hbase.apache.org" <user@hbase.apache.org>
Cc: "user@hbase.apache.org" <user@hbase.apache.org>
Sent: Monday, August 27, 2012 5:59 PM
Subject: Re: setTimeRange and setMaxVersions seem to be inefficient

Hi Lars:

Thanks for confirming the inefficiency of the implementation for this case. For 
my case, a column can have more than 10K versions, I need a quick way to stop 
the scan from digging the column once there is a match (ReturnCode.INCLUDE). It 
would be nice to have a ReturnCode that can notify the framework to stop and go 
to next column once the number of versions specify in setMaxVersions is met. 

For now, I guess I have to hack it in the custom filter (I.e. I keep the count 
myself)? If you have a better way to achieve this, please share :)

Best Regards,

Jerry

Sent from my iPad (sorry for spelling mistakes)

On 2012-08-27, at 20:11, lars hofhansl <lhofha...@yahoo.com> wrote:

> Currently filters are evaluated before we do version counting.
> 
> Here's a comment from ScanQueryMatcher.java:
>     /**
>      * Filters should be checked before checking column trackers. If we do
>      * otherwise, as was previously being done, ColumnTracker may increment 
>its
>      * counter for even that KV which may be discarded later on by Filter. 
>This
>      * would lead to incorrect results in certain cases.
>      */
> 
> 
> So this is by design. (Doesn't mean it's correct or desirable, though.)
> 
> -- Lars
> 
> 
> ----- Original Message -----
> From: Jerry Lam <chiling...@gmail.com>
> To: user <user@hbase.apache.org>
> Cc: 
> Sent: Monday, August 27, 2012 2:40 PM
> Subject: setTimeRange and setMaxVersions seem to be inefficient
> 
> Hi HBase community:
> 
> I tried to use setTimeRange and setMaxVersions to limit the number of KVs
> return per column. The behaviour is as I would expect that is
> setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of KV
> with timestamp that is less than or equal to T.
> However, I noticed that all versions of the KeyValue for a particular
> column are processed through a custom filter I implemented even though I
> specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if ONE
> KV of a particular column has ReturnCode.INCLUDE, the framework will jump
> to the next COL instead of iterating through all versions of the column.
> 
> Can someone confirm me if this is the expected behaviour (iterating through
> all versions of a column before setMaxVersions take effect)? If this is an
> expected behaviour, what is your recommendation to speed this up?
> 
> Best Regards,
> 
> Jerry
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Reply via email to