Re: Essential column family performance

Ted Yu Mon, 08 Apr 2013 07:50:20 -0700

I made the following change in TestJoinedScanners.java:

-      int flag_percent = 1;
+      int flag_percent = 40;


The test took longer but still favors joined scanner.
I got some new results:

2013-04-08 07:46:06,959 INFO  [main] regionserver.TestJoinedScanners(157):
Slow scanner finished in 7.424388 seconds, got 2050 rows
...
2013-04-08 07:46:12,010 INFO  [main] regionserver.TestJoinedScanners(157):
Joined scanner finished in 5.05063 seconds, got 2050 rows

2013-04-08 07:46:18,358 INFO  [main] regionserver.TestJoinedScanners(157):
Slow scanner finished in 6.348517 seconds, got 2050 rows
...
2013-04-08 07:46:22,946 INFO  [main] regionserver.TestJoinedScanners(157):
Joined scanner finished in 4.587545 seconds, got 2050 rows

Looks like effectiveness of joined scanner is affected by distribution of
data.

Cheers

On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> wrote:

> Looking at the joined scanner test code, it sets it up such that 1% of the
> rows match, which would somewhat be in line with James' results.
>
> In my own testing a while ago I found a 100% improvement with 0% match.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Ted Yu <[email protected]>
> To: [email protected]
> Sent: Sunday, April 7, 2013 4:13 PM
> Subject: Re: Essential column family performance
>
> I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
> reference.
>
> On my MacBook, I got the following results from the test:
>
> 2013-04-07 16:08:17,474 INFO  [main] regionserver.TestJoinedScanners(157):
> Slow scanner finished in 7.973822 seconds, got 100 rows
> ...
> 2013-04-07 16:08:17,946 INFO  [main] regionserver.TestJoinedScanners(157):
> Joined scanner finished in 0.47235 seconds, got 100 rows
>
> Cheers
>
> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote:
>
> > Looking at
> >
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt,
> I found that it didn't contain TestJoinedScanners which shows
> > difference in scanner performance:
> >
> >    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> > Double.toString(timeSec)
> >
> >       + " seconds, got " + Long.toString(rows_count/2) + " rows");
> >
> > The test uses SingleColumnValueFilter:
> >
> >     SingleColumnValueFilter filter = new SingleColumnValueFilter(
> >
> >         cf_essential, col_name, CompareFilter.CompareOp.EQUAL, flag_yes);
> > It is possible that the custom filter you were using would exhibit
> > different access pattern compared to SingleColumnValueFilter. e.g. does
> > your filter utilize hint ?
> > It would be easier for me and other people to reproduce the issue you
> > experienced if you put your scenario in some test similar to
> > TestJoinedScanners.
> >
> > Will take a closer look at the code Monday.
> >
> > Cheers
> >
> > On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected]
> >wrote:
> >
> >> Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
> so
> >> filterIfMissing isn't the issue - the results of the scan are correct.
> >>
> >> I can see that if the essential column family has more data compared to
> >> the non essential column family that the results would eventually even
> out.
> >> I was hoping to always be able to enable the essential column family
> >> feature. Is there an inherent reason why performance would degrade like
> >> this? Does it boil down to a single sequential scan versus many seeks?
> >>
> >> Thanks,
> >>
> >> James
> >>
> >>
> >> On 04/07/2013 07:44 AM, Ted Yu wrote:
> >>
> >>> James:
> >>> Your test was based on 0.94.6.1, right ?
> >>>
> >>> What Filter were you using ?
> >>>
> >>> If you used SingleColumnValueFilter, have you seen my comment here ?
> >>> https://issues.apache.org/**jira/browse/HBASE-5416?**
> >>> focusedCommentId=13541229&**page=com.atlassian.jira.**
> >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
> >
> >>>
> >>> BTW the use case Max Lapan tried to address has non essential column
> >>> family
> >>> carrying considerably more data compared to essential column family.
> >>>
> >>> Cheers
> >>>
> >>>
> >>>
> >>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <[email protected]
> >>> >wrote:
> >>>
> >>>  Hello,
> >>>> We're doing some performance testing of the essential column family
> >>>> feature, and we're seeing some performance degradation when comparing
> >>>> with
> >>>> and without the feature enabled:
> >>>>
> >>>>                            Performance of scan relative
> >>>> % of rows selected        to not enabling the feature
> >>>> ---------------------    ------------------------------****--
> >>>>
> >>>> 100%                            1.0x
> >>>>   80%                            2.0x
> >>>>   60%                            2.3x
> >>>>   40%                            2.2x
> >>>>   20%                            1.5x
> >>>>   10%                            1.0x
> >>>>    5%                            0.67x
> >>>>    0%                            0.30%
> >>>>
> >>>> In our scenario, we have two column families. The key value from the
> >>>> essential column family is used in the filter, while the key value
> from
> >>>> the
> >>>> other, non essential column family is returned by the scan. Each row
> >>>> contains values for both key values, with the values being relatively
> >>>> narrow (less than 50 bytes). In this scenario, the only time we're
> >>>> seeing a
> >>>> performance gain is when less than 10% of the rows are selected.
> >>>>
> >>>> Is this a reasonable test? Has anyone else measured this?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> James
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >
>

Re: Essential column family performance

Reply via email to