Re: setTimeRange and setMaxVersions seem to be inefficient

Jerry Lam Tue, 28 Aug 2012 07:18:16 -0700

Hi Lars:

Thanks for the reply.
I need to understand if I misunderstood the perceived inefficiency because
it seems you don't think quite the same.


Let say, as an example, we have 1 row with 2 columns (col-1 and col-2) in a
table and each column has 1000 versions. Using the following code (the code
might have errors and don't compile):
/**
 * This is very simple use case of a ColumnPrefixFilter.
 * In fact all other filters that make use of filterKeyValue will see
similar
 * performance problems that I have concerned with when the number of
 * versions per column could be huge.

Filter filter = new ColumnPrefixFilter(Bytes.toBytes("col-2"));
Scan scan = new Scan();
scan.setFilter(filter);
ResultScanner scanner = table.getScanner(scan);
for (Result result : scanner) {
    for (KeyValue kv : result.raw()) {
        System.out.println("KV: " + kv + ", Value: " +
        Bytes.toString(kv.getValue()));
    }
}
scanner.close();
*/

Implicitly, the number of version per column that is going to return is 1
(the latest version). User might expect that only 2 comparisons for column
prefix are needed (1 for col-1 and 1 for col-2) but in fact, it processes
the filterKeyValue method in ColumnPrefixFilter 1000 times (1 for col-1 and
1000 for col-2) for col-2 (1 per version) because all versions of the
column have the same prefix for obvious reason. For col-1, it will skip
using SEEK_NEXT_USING_HINT which should skip the 99 versions of col-1.

In summary, the 1000 comparisons (5000 byte comparisons) for the column
prefix "col-2" is wasted because only 1 version is returned to user. Also,
I believe this inefficiency is hidden from the user code but it affects all
filters that use filterKeyValue as the main execution for filtering KVs. Do
we have a case to improve HBase to handle this inefficiency? :) It seems
valid unless you prove otherwise.

Best Regards,

Jerry



On Tue, Aug 28, 2012 at 12:54 AM, lars hofhansl <lhofha...@yahoo.com> wrote:

> First off regarding "inefficiency"... If version counting would happen
> first and then filter were executed we'd have folks "complaining" about
> inefficiencies as well:
> ("Why does the code have to go through the versioning stuff when my filter
> filters the row/column/version anyway?")  ;-)
>
>
> For your problem, you want to make use of "seek hints"...
>
> In addition to INCLUDE you can return NEXT_COL, NEXT_ROW, or even
> SEEK_NEXT_USING_HINT from Filter.filterKeyValue(...).
>
> That way the scanning framework will know to skip ahead to the next
> column, row, or a KV of your choosing. (see Filter.filterKeyValue and
> Filter.getNextKeyHint).
>
> (as an aside, it would probably be nice if Filters also had
> INCLUDE_AND_NEXT_COL, INCLUDE_AND_NEXT_ROW, internally used by StoreScanner)
>
> Have a look at ColumnPrefixFilter as an example.
> I also wrote a short post here:
> http://hadoop-hbase.blogspot.com/2012/01/filters-in-hbase-or-intra-row-scanning.html
>
> Does that help?
>
> -- Lars
>
>
> ----- Original Message -----
> From: Jerry Lam <chiling...@gmail.com>
> To: "user@hbase.apache.org" <user@hbase.apache.org>
> Cc: "user@hbase.apache.org" <user@hbase.apache.org>
> Sent: Monday, August 27, 2012 5:59 PM
> Subject: Re: setTimeRange and setMaxVersions seem to be inefficient
>
> Hi Lars:
>
> Thanks for confirming the inefficiency of the implementation for this
> case. For my case, a column can have more than 10K versions, I need a quick
> way to stop the scan from digging the column once there is a match
> (ReturnCode.INCLUDE). It would be nice to have a ReturnCode that can notify
> the framework to stop and go to next column once the number of versions
> specify in setMaxVersions is met.
>
> For now, I guess I have to hack it in the custom filter (I.e. I keep the
> count myself)? If you have a better way to achieve this, please share :)
>
> Best Regards,
>
> Jerry
>
> Sent from my iPad (sorry for spelling mistakes)
>
> On 2012-08-27, at 20:11, lars hofhansl <lhofha...@yahoo.com> wrote:
>
> > Currently filters are evaluated before we do version counting.
> >
> > Here's a comment from ScanQueryMatcher.java:
> >     /**
> >      * Filters should be checked before checking column trackers. If we
> do
> >      * otherwise, as was previously being done, ColumnTracker may
> increment its
> >      * counter for even that KV which may be discarded later on by
> Filter. This
> >      * would lead to incorrect results in certain cases.
> >      */
> >
> >
> > So this is by design. (Doesn't mean it's correct or desirable, though.)
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Jerry Lam <chiling...@gmail.com>
> > To: user <user@hbase.apache.org>
> > Cc:
> > Sent: Monday, August 27, 2012 2:40 PM
> > Subject: setTimeRange and setMaxVersions seem to be inefficient
> >
> > Hi HBase community:
> >
> > I tried to use setTimeRange and setMaxVersions to limit the number of KVs
> > return per column. The behaviour is as I would expect that is
> > setTimeRange(0, T + 1) and setMaxVersions(1) will give me ONE version of
> KV
> > with timestamp that is less than or equal to T.
> > However, I noticed that all versions of the KeyValue for a particular
> > column are processed through a custom filter I implemented even though I
> > specify setMaxVersions(1) and setTimeRange(0, T+1). I expected that if
> ONE
> > KV of a particular column has ReturnCode.INCLUDE, the framework will jump
> > to the next COL instead of iterating through all versions of the column.
> >
> > Can someone confirm me if this is the expected behaviour (iterating
> through
> > all versions of a column before setMaxVersions take effect)? If this is
> an
> > expected behaviour, what is your recommendation to speed this up?
> >
> > Best Regards,
> >
> > Jerry
> >
>
>

Re: setTimeRange and setMaxVersions seem to be inefficient

Reply via email to