Hi Ted: Sure, will do. I also implement the reset method to set previousIncludedQualifier to null for the next row to come.
Best Regards, Jerry On Wed, Aug 29, 2012 at 1:47 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Jerry: > Remember to also implement: > > + @Override > + public KeyValue getNextKeyHint(KeyValue currentKV) { > > You can log a JIRA for supporting ReturnCode.INCLUDE_AND_NEXT_COL. > > Cheers > > On Wed, Aug 29, 2012 at 6:59 AM, Jerry Lam <chiling...@gmail.com> wrote: > > > Hi Lars: > > > > Thanks for spending time discussing this with me. I appreciate it. > > > > I tried to implement the setMaxVersions(1) inside the filter as follows: > > > > @Override > > public ReturnCode filterKeyValue(KeyValue kv) { > > > > // check if the same qualifier as the one that has been included > > previously. If yes, jump to next column > > if (previousIncludedQualifier != null && > > Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) { > > previousIncludedQualifier = null; > > return ReturnCode.NEXT_COL; > > } > > // another condition that makes the jump further using HINT > > if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) { > > LOG.info("Matched Found."); > > return ReturnCode.SEEK_NEXT_USING_HINT; > > > > } > > // include this to the result and keep track of the included > > qualifier so the next version of the same qualifier will be excluded > > previousIncludedQualifier = kv.getQualifier(); > > return ReturnCode.INCLUDE; > > } > > > > Does this look reasonable or there is a better way to achieve this? It > > would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case > though. > > > > Best Regards, > > > > Jerry > > > > > > On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl <lhofha...@yahoo.com> > > wrote: > > > > > Hi Jerry, > > > > > > my answer will be the same again: > > > Some folks will want the max versions set by the client to be before > > > filters and some folks will want it to restrict the end result. > > > It's not possible to have it both ways. Your filter needs to do the > right > > > thing. > > > > > > > > > There's a lot of discussion around this in HBASE-5104. > > > > > > > > > -- Lars > > > > > > > > > > > > ________________________________ > > > From: Jerry Lam <chiling...@gmail.com> > > > To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com> > > > Sent: Tuesday, August 28, 2012 1:52 PM > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > > > > > Hi Lars: > > > > > > I see. Please refer to the inline comment below. > > > > > > Best Regards, > > > > > > Jerry > > > > > > On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <lhofha...@yahoo.com> > > > wrote: > > > > > > > What I was saying was: It depends. :) > > > > > > > > First off, how do you get to 1000 versions? In 0.94++ older version > are > > > > pruned upon flush, so you need 333 flushes (assuming 3 versions on > the > > > CF) > > > > to get 1000 versions. > > > > > > > > > > I forgot that the default number of version to keep is 3. If this is > what > > > people use most of the time, yes you are right for this type of > scenarios > > > where the number of version per column to keep is small. > > > > > > By that time some compactions will have happened and you're back to > close > > > > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files > > you > > > > have). > > > > > > > > Now, if you have that many version because because you set > > VERSIONS=>1000 > > > > in your CF... Then imagine you have 100 columns with 1000 versions > > each. > > > > > > > > > > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the > > > versioning myself) > > > > > > In your scenario below you'd do 100000 comparisons if the filter would > be > > > > evaluated after the version counting. But only 1100 with the current > > > code. > > > > (or at least in that ball park) > > > > > > > > > > This is where I don't quite understand what you mean. > > > > > > if the framework counts the number of ReturnCode.INCLUDE and then stops > > > feeding the KeyValue into the filterKeyValue method after it reaches > the > > > count specified in setMaxVersions (i.e. 1 for the case we discussed), > > > should then be just 100 comparisons only (at most) instead of 1100 > > > comparisons? Maybe I don't understand how the current way is doing... > > > > > > > > > > > > > > > > > The gist is: One can construct scenarios where one approach is better > > > than > > > > the other. Only one order is possible. > > > > If you write a custom filter and you care about these things you > should > > > > use the seek hints. > > > > > > > > -- Lars > > > > > > > > > > > > ----- Original Message ----- > > > > From: Jerry Lam <chiling...@gmail.com> > > > > To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com> > > > > Cc: > > > > Sent: Tuesday, August 28, 2012 7:17 AM > > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > > > > > > > Hi Lars: > > > > > > > > Thanks for the reply. > > > > I need to understand if I misunderstood the perceived inefficiency > > > because > > > > it seems you don't think quite the same. > > > > > > > > Let say, as an example, we have 1 row with 2 columns (col-1 and > col-2) > > > in a > > > > table and each column has 1000 versions. Using the following code > (the > > > code > > > > might have errors and don't compile): > > > > /** > > > > * This is very simple use case of a ColumnPrefixFilter. > > > > * In fact all other filters that make use of filterKeyValue will see > > > > similar > > > > * performance problems that I have concerned with when the number of > > > > * versions per column could be huge. > > > > > > > > Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2")); > > > > Scan scan = new Scan(); > > > > scan.setFilter(filter); > > > > ResultScanner scanner = table.getScanner(scan); > > > > for (Result result : scanner) { > > > > for (KeyValue kv : result.raw()) { > > > > System.out.println("KV: " + kv + ", Value: " + > > > > Bytes.toString(kv.getValue())); > > > > } > > > > } > > > > scanner.close(); > > > > */ > > > > > > > > Implicitly, the number of version per column that is going to return > > is 1 > > > > (the latest version). User might expect that only 2 comparisons for > > > column > > > > prefix are needed (1 for col-1 and 1 for col-2) but in fact, it > > processes > > > > the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for > col-1 > > > and > > > > 1000 for col-2) for col-2 (1 per version) because all versions of the > > > > column have the same prefix for obvious reason. For col-1, it will > skip > > > > using SEEK_NEXT_USING_HINT which should skip the 99 versions of > col-1. > > > > > > > > In summary, the 1000 comparisons (5000 byte comparisons) for the > column > > > > prefix "col-2" is wasted because only 1 version is returned to user. > > > Also, > > > > I believe this inefficiency is hidden from the user code but it > affects > > > all > > > > filters that use filterKeyValue as the main execution for filtering > > KVs. > > > Do > > > > we have a case to improve HBase to handle this inefficiency? :) It > > seems > > > > valid unless you prove otherwise. > > > > > > > > Best Regards, > > > > > > > > Jerry > > > > > > > > > > > > > > > > On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lhofha...@yahoo.com > > > > > > wrote: > > > > > > > > > First off regarding "inefficiency"... If version counting would > > happen > > > > > first and then filter were executed we'd have folks "complaining" > > about > > > > > inefficiencies as well: > > > > > ("Why does the code have to go through the versioning stuff when my > > > > filter > > > > > filters the row/column/version anyway?") ;-) > > > > > > > > > > > > > > > For your problem, you want to make use of "seek hints"... > > > > > > > > > > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even > > > > > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...). > > > > > > > > > > That way the scanning framework will know to skip ahead to the next > > > > > column, row, or a KV of your choosing. (see Filter.filterKeyValue > and > > > > > Filter.getNextKeyHint). > > > > > > > > > > (as an aside, it would probably be nice if Filters also had > > > > > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by > > > > StoreScanner) > > > > > > > > > > Have a look at ColumnPrefixFilter as an example. > > > > > I also wrote a short post here: > > > > > > > > > > > > > > > http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html > > > > > > > > > > Does that help? > > > > > > > > > > -- Lars > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > From: Jerry Lam <chiling...@gmail.com> > > > > > To: "user@hbase.apache.org" <user@hbase.apache.org> > > > > > Cc: "user@hbase.apache.org" <user@hbase.apache.org> > > > > > Sent: Monday, August 27, 2012 5:59 PM > > > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient > > > > > > > > > > Hi Lars: > > > > > > > > > > Thanks for confirming the inefficiency of the implementation for > this > > > > > case. For my case, a column can have more than 10K versions, I > need a > > > > quick > > > > > way to stop the scan from digging the column once there is a match > > > > > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that > can > > > > notify > > > > > the framework to stop and go to next column once the number of > > versions > > > > > specify in setMaxVersions is met. > > > > > > > > > > For now, I guess I have to hack it in the custom filter (I.e. I > keep > > > the > > > > > count myself)? If you have a better way to achieve this, please > share > > > :) > > > > > > > > > > Best Regards, > > > > > > > > > > Jerry > > > > > > > > > > Sent from my iPad (sorry for spelling mistakes) > > > > > > > > > > On 2012-08-27, at 20:11, lars hofhansl <lhofha...@yahoo.com> > wrote: > > > > > > > > > > > Currently filters are evaluated before we do version counting. > > > > > > > > > > > > Here's a comment from ScanQueryMatcher.java: > > > > > > /** > > > > > > * Filters should be checked before checking column trackers. > > If > > > we > > > > > do > > > > > > * otherwise, as was previously being done, ColumnTracker may > > > > > increment its > > > > > > * counter for even that KV which may be discarded later on > by > > > > > Filter. This > > > > > > * would lead to incorrect results in certain cases. > > > > > > */ > > > > > > > > > > > > > > > > > > So this is by design. (Doesn't mean it's correct or desirable, > > > though.) > > > > > > > > > > > > -- Lars > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: Jerry Lam <chiling...@gmail.com> > > > > > > To: user <user@hbase.apache.org> > > > > > > Cc: > > > > > > Sent: Monday, August 27, 2012 2:40 PM > > > > > > Subject: setTimeRange and setMaxVersions seem to be inefficient > > > > > > > > > > > > Hi HBase community: > > > > > > > > > > > > I tried to use setTimeRange and setMaxVersions to limit the > number > > of > > > > KVs > > > > > > return per column. The behaviour is as I would expect that is > > > > > > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE > > version > > > > of > > > > > KV > > > > > > with timestamp that is less than or equal to T. > > > > > > However, I noticed that all versions of the KeyValue for a > > particular > > > > > > column are processed through a custom filter I implemented even > > > though > > > > I > > > > > > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected > that > > > if > > > > > ONE > > > > > > KV of a particular column has ReturnCode.INCLUDE, the framework > > will > > > > jump > > > > > > to the next COL instead of iterating through all versions of the > > > > column. > > > > > > > > > > > > Can someone confirm me if this is the expected behaviour > (iterating > > > > > through > > > > > > all versions of a column before setMaxVersions take effect)? If > > this > > > is > > > > > an > > > > > > expected behaviour, what is your recommendation to speed this up? > > > > > > > > > > > > Best Regards, > > > > > > > > > > > > Jerry > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >