Re: setTimeRange and setMaxVersions seem to be inefficient

Jerry Lam Wed, 29 Aug 2012 11:37:16 -0700

Hi Ted:

Sure, will do.
I also implement the reset method to set previousIncludedQualifier to null
for the next row to come.


Best Regards,

Jerry

On Wed, Aug 29, 2012 at 1:47 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> Jerry:
> Remember to also implement:
>
> +  @Override
> +  public KeyValue getNextKeyHint(KeyValue currentKV) {
>
> You can log a JIRA for supporting ReturnCode.INCLUDE_AND_NEXT_COL.
>
> Cheers
>
> On Wed, Aug 29, 2012 at 6:59 AM, Jerry Lam <chiling...@gmail.com> wrote:
>
> > Hi Lars:
> >
> > Thanks for spending time discussing this with me. I appreciate it.
> >
> > I tried to implement the setMaxVersions(1) inside the filter as follows:
> >
> > @Override
> > public ReturnCode filterKeyValue(KeyValue kv) {
> >
> > // check if the same qualifier as the one that has been included
> > previously. If yes, jump to next column
> > if (previousIncludedQualifier != null &&
> > Bytes.compareTo(previousIncludedQualifier,kv.getQualifier()) == 0) {
> > previousIncludedQualifier = null;
> > return ReturnCode.NEXT_COL;
> > }
> >         // another condition that makes the jump further using HINT
> > if (Bytes.compareTo(this.qualifier, kv.getQualifier()) == 0) {
> > LOG.info("Matched Found.");
> > return ReturnCode.SEEK_NEXT_USING_HINT;
> >
> > }
> >         // include this to the result and keep track of the included
> > qualifier so the next version of the same qualifier will be excluded
> > previousIncludedQualifier = kv.getQualifier();
> > return ReturnCode.INCLUDE;
> > }
> >
> > Does this look reasonable or there is a better way to achieve this? It
> > would be nice to have ReturnCode.INCLUDE_AND_NEXT_COL for this case
> though.
> >
> > Best Regards,
> >
> > Jerry
> >
> >
> > On Wed, Aug 29, 2012 at 2:09 AM, lars hofhansl <lhofha...@yahoo.com>
> > wrote:
> >
> > > Hi Jerry,
> > >
> > > my answer will be the same again:
> > > Some folks will want the max versions set by the client to be before
> > > filters and some folks will want it to restrict the end result.
> > > It's not possible to have it both ways. Your filter needs to do the
> right
> > > thing.
> > >
> > >
> > > There's a lot of discussion around this in HBASE-5104.
> > >
> > >
> > > -- Lars
> > >
> > >
> > >
> > > ________________________________
> > >  From: Jerry Lam <chiling...@gmail.com>
> > > To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com>
> > > Sent: Tuesday, August 28, 2012 1:52 PM
> > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > >
> > > Hi Lars:
> > >
> > > I see. Please refer to the inline comment below.
> > >
> > > Best Regards,
> > >
> > > Jerry
> > >
> > > On Tue, Aug 28, 2012 at 2:21 PM, lars hofhansl <lhofha...@yahoo.com>
> > > wrote:
> > >
> > > > What I was saying was: It depends. :)
> > > >
> > > > First off, how do you get to 1000 versions? In 0.94++ older version
> are
> > > > pruned upon flush, so you need 333 flushes (assuming 3 versions on
> the
> > > CF)
> > > > to get 1000 versions.
> > > >
> > >
> > > I forgot that the default number of version to keep is 3. If this is
> what
> > > people use most of the time, yes you are right for this type of
> scenarios
> > > where the number of version per column to keep is small.
> > >
> > > By that time some compactions will have happened and you're back to
> close
> > > > to 3 versions (maybe 9, 12, or 15 or so, depending on how store files
> > you
> > > > have).
> > > >
> > > > Now, if you have that many version because because you set
> > VERSIONS=>1000
> > > > in your CF... Then imagine you have 100 columns with 1000 versions
> > each.
> > > >
> > >
> > > Yes, imagine I set VERSIONS => Long.MAX_VALUE (i.e. I will manage the
> > > versioning myself)
> > >
> > > In your scenario below you'd do 100000 comparisons if the filter would
> be
> > > > evaluated after the version counting. But only 1100 with the current
> > > code.
> > > > (or at least in that ball park)
> > > >
> > >
> > > This is where I don't quite understand what you mean.
> > >
> > > if the framework counts the number of ReturnCode.INCLUDE and then stops
> > > feeding the KeyValue into the filterKeyValue method after it reaches
> the
> > > count specified in setMaxVersions (i.e. 1 for the case we discussed),
> > > should then be just 100 comparisons only (at most) instead of 1100
> > > comparisons? Maybe I don't understand how the current way is doing...
> > >
> > >
> > >
> > > >
> > > > The gist is: One can construct scenarios where one approach is better
> > > than
> > > > the other. Only one order is possible.
> > > > If you write a custom filter and you care about these things you
> should
> > > > use the seek hints.
> > > >
> > > > -- Lars
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: Jerry Lam <chiling...@gmail.com>
> > > > To: user@hbase.apache.org; lars hofhansl <lhofha...@yahoo.com>
> > > > Cc:
> > > > Sent: Tuesday, August 28, 2012 7:17 AM
> > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > > >
> > > > Hi Lars:
> > > >
> > > > Thanks for the reply.
> > > > I need to understand if I misunderstood the perceived inefficiency
> > > because
> > > > it seems you don't think quite the same.
> > > >
> > > > Let say, as an example, we have 1 row with 2 columns (col-1 and
> col-2)
> > > in a
> > > > table and each column has 1000 versions. Using the following code
> (the
> > > code
> > > > might have errors and don't compile):
> > > > /**
> > > > * This is very simple use case of a ColumnPrefixFilter.
> > > > * In fact all other filters that make use of filterKeyValue will see
> > > > similar
> > > > * performance problems that I have concerned with when the number of
> > > > * versions per column could be huge.
> > > >
> > > > Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
> > > > Scan scan = new Scan();
> > > > scan.setFilter(filter);
> > > > ResultScanner scanner = table.getScanner(scan);
> > > > for (Result result : scanner) {
> > > >     for (KeyValue kv : result.raw()) {
> > > >         System.out.println("KV: " + kv + ", Value: " +
> > > >         Bytes.toString(kv.getValue()));
> > > >     }
> > > > }
> > > > scanner.close();
> > > > */
> > > >
> > > > Implicitly, the number of version per column that is going to return
> > is 1
> > > > (the latest version). User might expect that only 2 comparisons for
> > > column
> > > > prefix are needed (1 for col-1 and 1 for col-2) but in fact, it
> > processes
> > > > the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for
> col-1
> > > and
> > > > 1000 for col-2) for col-2 (1 per version) because all versions of the
> > > > column have the same prefix for obvious reason. For col-1, it will
> skip
> > > > using SEEK_NEXT_USING_HINT which should skip the 99 versions of
> col-1.
> > > >
> > > > In summary, the 1000 comparisons (5000 byte comparisons) for the
> column
> > > > prefix "col-2" is wasted because only 1 version is returned to user.
> > > Also,
> > > > I believe this inefficiency is hidden from the user code but it
> affects
> > > all
> > > > filters that use filterKeyValue as the main execution for filtering
> > KVs.
> > > Do
> > > > we have a case to improve HBase to handle this inefficiency? :) It
> > seems
> > > > valid unless you prove otherwise.
> > > >
> > > > Best Regards,
> > > >
> > > > Jerry
> > > >
> > > >
> > > >
> > > > On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lhofha...@yahoo.com
> >
> > > > wrote:
> > > >
> > > > > First off regarding "inefficiency"... If version counting would
> > happen
> > > > > first and then filter were executed we'd have folks "complaining"
> > about
> > > > > inefficiencies as well:
> > > > > ("Why does the code have to go through the versioning stuff when my
> > > > filter
> > > > > filters the row/column/version anyway?")  ;-)
> > > > >
> > > > >
> > > > > For your problem, you want to make use of "seek hints"...
> > > > >
> > > > > In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> > > > > SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
> > > > >
> > > > > That way the scanning framework will know to skip ahead to the next
> > > > > column, row, or a KV of your choosing. (see Filter.filterKeyValue
> and
> > > > > Filter.getNextKeyHint).
> > > > >
> > > > > (as an aside, it would probably be nice if Filters also had
> > > > > INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by
> > > > StoreScanner)
> > > > >
> > > > > Have a look at ColumnPrefixFilter as an example.
> > > > > I also wrote a short post here:
> > > > >
> > > >
> > >
> >
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
> > > > >
> > > > > Does that help?
> > > > >
> > > > > -- Lars
> > > > >
> > > > >
> > > > > ----- Original Message -----
> > > > > From: Jerry Lam <chiling...@gmail.com>
> > > > > To: "user@hbase.apache.org" <user@hbase.apache.org>
> > > > > Cc: "user@hbase.apache.org" <user@hbase.apache.org>
> > > > > Sent: Monday, August 27, 2012 5:59 PM
> > > > > Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
> > > > >
> > > > > Hi Lars:
> > > > >
> > > > > Thanks for confirming the inefficiency of the implementation for
> this
> > > > > case. For my case, a column can have more than 10K versions, I
> need a
> > > > quick
> > > > > way to stop the scan from digging the column once there is a match
> > > > > (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that
> can
> > > > notify
> > > > > the framework to stop and go to next column once the number of
> > versions
> > > > > specify in setMaxVersions is met.
> > > > >
> > > > > For now, I guess I have to hack it in the custom filter (I.e. I
> keep
> > > the
> > > > > count myself)? If you have a better way to achieve this, please
> share
> > > :)
> > > > >
> > > > > Best Regards,
> > > > >
> > > > > Jerry
> > > > >
> > > > > Sent from my iPad (sorry for spelling mistakes)
> > > > >
> > > > > On 2012-08-27, at 20:11, lars hofhansl <lhofha...@yahoo.com>
> wrote:
> > > > >
> > > > > > Currently filters are evaluated before we do version counting.
> > > > > >
> > > > > > Here's a comment from ScanQueryMatcher.java:
> > > > > >     /**
> > > > > >      * Filters should be checked before checking column trackers.
> > If
> > > we
> > > > > do
> > > > > >      * otherwise, as was previously being done, ColumnTracker may
> > > > > increment its
> > > > > >      * counter for even that KV which may be discarded later on
> by
> > > > > Filter. This
> > > > > >      * would lead to incorrect results in certain cases.
> > > > > >      */
> > > > > >
> > > > > >
> > > > > > So this is by design. (Doesn't mean it's correct or desirable,
> > > though.)
> > > > > >
> > > > > > -- Lars
> > > > > >
> > > > > >
> > > > > > ----- Original Message -----
> > > > > > From: Jerry Lam <chiling...@gmail.com>
> > > > > > To: user <user@hbase.apache.org>
> > > > > > Cc:
> > > > > > Sent: Monday, August 27, 2012 2:40 PM
> > > > > > Subject: setTimeRange and setMaxVersions seem to be inefficient
> > > > > >
> > > > > > Hi HBase community:
> > > > > >
> > > > > > I tried to use setTimeRange and setMaxVersions to limit the
> number
> > of
> > > > KVs
> > > > > > return per column. The behaviour is as I would expect that is
> > > > > > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE
> > version
> > > > of
> > > > > KV
> > > > > > with timestamp that is less than or equal to T.
> > > > > > However, I noticed that all versions of the KeyValue for a
> > particular
> > > > > > column are processed through a custom filter I implemented even
> > > though
> > > > I
> > > > > > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected
> that
> > > if
> > > > > ONE
> > > > > > KV of a particular column has ReturnCode.INCLUDE, the framework
> > will
> > > > jump
> > > > > > to the next COL instead of iterating through all versions of the
> > > > column.
> > > > > >
> > > > > > Can someone confirm me if this is the expected behaviour
> (iterating
> > > > > through
> > > > > > all versions of a column before setMaxVersions take effect)? If
> > this
> > > is
> > > > > an
> > > > > > expected behaviour, what is your recommendation to speed this up?
> > > > > >
> > > > > > Best Regards,
> > > > > >
> > > > > > Jerry
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Reply via email to