First off regarding "inefficiency"... If version counting would happen first and then filter were executed we'd have folks "complaining" about inefficiencies as well: ("Why does the code have to go through the versioning stuff when my filter filters the row/column/version anyway?") ;-)
For your problem, you want to make use of "seek hints"... In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...). That way the scanning framework will know to skip ahead to the next column, row, or a KV of your choosing. (see Filter.filterKeyValue and Filter.getNextKeyHint). (as an aside, it would probably be nice if Filters also had INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by StoreScanner) Have a look at ColumnPrefixFilter as an example. I also wrote a short post here: http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html Does that help? -- Lars ----- Original Message ----- From: Jerry Lam <chiling...@gmail.com> To: "user@hbase.apache.org" <user@hbase.apache.org> Cc: "user@hbase.apache.org" <user@hbase.apache.org> Sent: Monday, August 27, 2012 5:59 PM Subject: Re: setTimeRange and setMaxVersions seem to be inefficient Hi Lars: Thanks for confirming the inefficiency of the implementation for this case. For my case, a column can have more than 10K versions, I need a quick way to stop the scan from digging the column once there is a match (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can notify the framework to stop and go to next column once the number of versions specify in setMaxVersions is met. For now, I guess I have to hack it in the custom filter (I.e. I keep the count myself)? If you have a better way to achieve this, please share :) Best Regards, Jerry Sent from my iPad (sorry for spelling mistakes) On 2012-08-27, at 20:11, lars hofhansl <lhofha...@yahoo.com> wrote: > Currently filters are evaluated before we do version counting. > > Here's a comment from ScanQueryMatcher.java: > /** > * Filters should be checked before checking column trackers. If we do > * otherwise, as was previously being done, ColumnTracker may increment >its > * counter for even that KV which may be discarded later on by Filter. >This > * would lead to incorrect results in certain cases. > */ > > > So this is by design. (Doesn't mean it's correct or desirable, though.) > > -- Lars > > > ----- Original Message ----- > From: Jerry Lam <chiling...@gmail.com> > To: user <user@hbase.apache.org> > Cc: > Sent: Monday, August 27, 2012 2:40 PM > Subject: setTimeRange and setMaxVersions seem to be inefficient > > Hi HBase community: > > I tried to use setTimeRange and setMaxVersions to limit the number of KVs > return per column. The behaviour is as I would expect that is > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of KV > with timestamp that is less than or equal to T. > However, I noticed that all versions of the KeyValue for a particular > column are processed through a custom filter I implemented even though I > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if ONE > KV of a particular column has ReturnCode.INCLUDE, the framework will jump > to the next COL instead of iterating through all versions of the column. > > Can someone confirm me if this is the expected behaviour (iterating through > all versions of a column before setMaxVersions take effect)? If this is an > expected behaviour, what is your recommendation to speed this up? > > Best Regards, > > Jerry >