Re: Essential column family performance

ramkrishna vasudevan Mon, 08 Apr 2013 10:52:15 -0700

bq. through multiple scans along the region boundaries
Sorry am not able to get what you are saying. Could you elaborate on this?
 I think the validity of this essential CF feature is best tested in real
use cases as that in Phoenix.


Regards
Ram


On Mon, Apr 8, 2013 at 11:12 PM, Ted Yu <[email protected]> wrote:

> bq. is the 40% randomly distributed or sequential?
> Looks like the distribution is striped:
>
>         if (i % 100 <= flag_percent) {
>
>           put.add(cf_essential, col_name, flag_yes);
> In each stripe, it is sequential.
>
> Let me try simulating random distribution.
>
> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[email protected]
> >wrote:
>
> > In the TestJoinedScanners.java, is the 40% randomly distributed or
> > sequential?
> >
> > In our test, the % is randomly distributed. Also, our custom filter does
> > the same thing that SingleColumnValueFilter does.  On the client-side,
> we'd
> > execute the query in parallel, through multiple scans along the region
> > boundaries. Would that have a negative impact on performance for this
> > "essential column family" feature?
> >
> > Thanks,
> >
> >     James
> >
> >
> > On 04/08/2013 10:10 AM, Anoop John wrote:
> >
> >> Agree here. The effectiveness depends on what % of data satisfies the
> >> condition, how it is distributed across HFile blocks. We will get
> >> performance gain when the we will be able to skip some HFile blocks
> (from
> >> non essential CFs). Can test with different HFile block size (lower
> >> value)?
> >>
> >> -Anoop-
> >>
> >>
> >> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote:
> >>
> >>  I made the following change in TestJoinedScanners.java:
> >>>
> >>> -      int flag_percent = 1;
> >>> +      int flag_percent = 40;
> >>>
> >>> The test took longer but still favors joined scanner.
> >>> I got some new results:
> >>>
> >>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Slow scanner finished in 7.424388 seconds, got 2050 rows
> >>> ...
> >>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Joined scanner finished in 5.05063 seconds, got 2050 rows
> >>>
> >>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Slow scanner finished in 6.348517 seconds, got 2050 rows
> >>> ...
> >>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
> >>> TestJoinedScanners(157):
> >>> Joined scanner finished in 4.587545 seconds, got 2050 rows
> >>>
> >>> Looks like effectiveness of joined scanner is affected by distribution
> of
> >>> data.
> >>>
> >>> Cheers
> >>>
> >>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]>
> wrote:
> >>>
> >>>  Looking at the joined scanner test code, it sets it up such that 1% of
> >>>>
> >>> the
> >>>
> >>>> rows match, which would somewhat be in line with James' results.
> >>>>
> >>>> In my own testing a while ago I found a 100% improvement with 0%
> match.
> >>>>
> >>>>
> >>>> -- Lars
> >>>>
> >>>>
> >>>>
> >>>> ______________________________**__
> >>>>   From: Ted Yu <[email protected]>
> >>>> To: [email protected]
> >>>> Sent: Sunday, April 7, 2013 4:13 PM
> >>>> Subject: Re: Essential column family performance
> >>>>
> >>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
> >>>> your
> >>>> reference.
> >>>>
> >>>> On my MacBook, I got the following results from the test:
> >>>>
> >>>> 2013-04-07 16:08:17,474 INFO  [main]
> >>>>
> >>> regionserver.**TestJoinedScanners(157):
> >>>
> >>>> Slow scanner finished in 7.973822 seconds, got 100 rows
> >>>> ...
> >>>> 2013-04-07 16:08:17,946 INFO  [main]
> >>>>
> >>> regionserver.**TestJoinedScanners(157):
> >>>
> >>>> Joined scanner finished in 0.47235 seconds, got 100 rows
> >>>>
> >>>> Cheers
> >>>>
> >>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote:
> >>>>
> >>>>  Looking at
> >>>>>
> >>>>>  https://issues.apache.org/**jira/secure/attachment/**
> >>> 12564340/5416-0.94-v3.txt<
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
> >
> >>> ,
> >>>
> >>>> I found that it didn't contain TestJoinedScanners which shows
> >>>>
> >>>>> difference in scanner performance:
> >>>>>
> >>>>>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> >>>>> Double.toString(timeSec)
> >>>>>
> >>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
> >>>>>
> >>>>> The test uses SingleColumnValueFilter:
> >>>>>
> >>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
> >>>>>
> >>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
> >>>>>
> >>>> flag_yes);
> >>>
> >>>> It is possible that the custom filter you were using would exhibit
> >>>>> different access pattern compared to SingleColumnValueFilter. e.g.
> does
> >>>>> your filter utilize hint ?
> >>>>> It would be easier for me and other people to reproduce the issue you
> >>>>> experienced if you put your scenario in some test similar to
> >>>>> TestJoinedScanners.
> >>>>>
> >>>>> Will take a closer look at the code Monday.
> >>>>>
> >>>>> Cheers
> >>>>>
> >>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <
> [email protected]
> >>>>> wrote:
> >>>>>
> >>>>>  Yes, on 0.94.6. We have our own custom filter derived from
> FilterBase,
> >>>>>>
> >>>>> so
> >>>>
> >>>>> filterIfMissing isn't the issue - the results of the scan are
> correct.
> >>>>>>
> >>>>>> I can see that if the essential column family has more data compared
> >>>>>>
> >>>>> to
> >>>
> >>>> the non essential column family that the results would eventually even
> >>>>>>
> >>>>> out.
> >>>>
> >>>>> I was hoping to always be able to enable the essential column family
> >>>>>> feature. Is there an inherent reason why performance would degrade
> >>>>>>
> >>>>> like
> >>>
> >>>> this? Does it boil down to a single sequential scan versus many seeks?
> >>>>>>
> >>>>>> Thanks,
> >>>>>>
> >>>>>> James
> >>>>>>
> >>>>>>
> >>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
> >>>>>>
> >>>>>>  James:
> >>>>>>> Your test was based on 0.94.6.1, right ?
> >>>>>>>
> >>>>>>> What Filter were you using ?
> >>>>>>>
> >>>>>>> If you used SingleColumnValueFilter, have you seen my comment here
> ?
> >>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<
> https://issues.apache.org/**jira/browse/HBASE-5416?**>
> >>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
> >>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
> >>>>>>> 13541229<
> >>>>>>>
> >>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
> >>> focusedCommentId=13541229&**page=com.atlassian.jira.**
> >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
> >
> >>>
> >>>> BTW the use case Max Lapan tried to address has non essential column
> >>>>>>> family
> >>>>>>> carrying considerably more data compared to essential column
> family.
> >>>>>>>
> >>>>>>> Cheers
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
> >>>>>>>
> >>>>>> [email protected]
> >>>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>   Hello,
> >>>>>>>
> >>>>>>>> We're doing some performance testing of the essential column
> family
> >>>>>>>> feature, and we're seeing some performance degradation when
> >>>>>>>>
> >>>>>>> comparing
> >>>
> >>>> with
> >>>>>>>> and without the feature enabled:
> >>>>>>>>
> >>>>>>>>                             Performance of scan relative
> >>>>>>>> % of rows selected        to not enabling the feature
> >>>>>>>> ---------------------    ------------------------------******--
> >>>>>>>>
> >>>>>>>> 100%                            1.0x
> >>>>>>>>    80%                            2.0x
> >>>>>>>>    60%                            2.3x
> >>>>>>>>    40%                            2.2x
> >>>>>>>>    20%                            1.5x
> >>>>>>>>    10%                            1.0x
> >>>>>>>>     5%                            0.67x
> >>>>>>>>     0%                            0.30%
> >>>>>>>>
> >>>>>>>> In our scenario, we have two column families. The key value from
> the
> >>>>>>>> essential column family is used in the filter, while the key value
> >>>>>>>>
> >>>>>>> from
> >>>>
> >>>>> the
> >>>>>>>> other, non essential column family is returned by the scan. Each
> row
> >>>>>>>> contains values for both key values, with the values being
> >>>>>>>>
> >>>>>>> relatively
> >>>
> >>>> narrow (less than 50 bytes). In this scenario, the only time we're
> >>>>>>>> seeing a
> >>>>>>>> performance gain is when less than 10% of the rows are selected.
> >>>>>>>>
> >>>>>>>> Is this a reasonable test? Has anyone else measured this?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>>
> >>>>>>>> James
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >
>

Re: Essential column family performance

Reply via email to