Re: Essential column family performance

Jean-Marc Spaggiari Mon, 08 Apr 2013 10:19:51 -0700

Something I'm not getting, why not using separate tables instead of
CFs for a single table? Simply name your table tablename_cfname then
you get ride of the CF# limitation?


Or is there big pros to have CFs?

JM

2013/4/8 Anoop John <[email protected]>:
> Agree here. The effectiveness depends on what % of data satisfies the
> condition, how it is distributed across HFile blocks. We will get
> performance gain when the we will be able to skip some HFile blocks (from
> non essential CFs). Can test with different HFile block size (lower value)?
>
> -Anoop-
>
>
> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote:
>
>> I made the following change in TestJoinedScanners.java:
>>
>> -      int flag_percent = 1;
>> +      int flag_percent = 40;
>>
>> The test took longer but still favors joined scanner.
>> I got some new results:
>>
>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.TestJoinedScanners(157):
>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>> ...
>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.TestJoinedScanners(157):
>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>
>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.TestJoinedScanners(157):
>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>> ...
>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.TestJoinedScanners(157):
>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>
>> Looks like effectiveness of joined scanner is affected by distribution of
>> data.
>>
>> Cheers
>>
>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> wrote:
>>
>> > Looking at the joined scanner test code, it sets it up such that 1% of
>> the
>> > rows match, which would somewhat be in line with James' results.
>> >
>> > In my own testing a while ago I found a 100% improvement with 0% match.
>> >
>> >
>> > -- Lars
>> >
>> >
>> >
>> > ________________________________
>> >  From: Ted Yu <[email protected]>
>> > To: [email protected]
>> > Sent: Sunday, April 7, 2013 4:13 PM
>> > Subject: Re: Essential column family performance
>> >
>> > I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
>> > reference.
>> >
>> > On my MacBook, I got the following results from the test:
>> >
>> > 2013-04-07 16:08:17,474 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>> > Slow scanner finished in 7.973822 seconds, got 100 rows
>> > ...
>> > 2013-04-07 16:08:17,946 INFO  [main]
>> regionserver.TestJoinedScanners(157):
>> > Joined scanner finished in 0.47235 seconds, got 100 rows
>> >
>> > Cheers
>> >
>> > On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote:
>> >
>> > > Looking at
>> > >
>> >
>> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
>> ,
>> > I found that it didn't contain TestJoinedScanners which shows
>> > > difference in scanner performance:
>> > >
>> > >    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>> > > Double.toString(timeSec)
>> > >
>> > >       + " seconds, got " + Long.toString(rows_count/2) + " rows");
>> > >
>> > > The test uses SingleColumnValueFilter:
>> > >
>> > >     SingleColumnValueFilter filter = new SingleColumnValueFilter(
>> > >
>> > >         cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>> flag_yes);
>> > > It is possible that the custom filter you were using would exhibit
>> > > different access pattern compared to SingleColumnValueFilter. e.g. does
>> > > your filter utilize hint ?
>> > > It would be easier for me and other people to reproduce the issue you
>> > > experienced if you put your scenario in some test similar to
>> > > TestJoinedScanners.
>> > >
>> > > Will take a closer look at the code Monday.
>> > >
>> > > Cheers
>> > >
>> > > On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected]
>> > >wrote:
>> > >
>> > >> Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>> > so
>> > >> filterIfMissing isn't the issue - the results of the scan are correct.
>> > >>
>> > >> I can see that if the essential column family has more data compared
>> to
>> > >> the non essential column family that the results would eventually even
>> > out.
>> > >> I was hoping to always be able to enable the essential column family
>> > >> feature. Is there an inherent reason why performance would degrade
>> like
>> > >> this? Does it boil down to a single sequential scan versus many seeks?
>> > >>
>> > >> Thanks,
>> > >>
>> > >> James
>> > >>
>> > >>
>> > >> On 04/07/2013 07:44 AM, Ted Yu wrote:
>> > >>
>> > >>> James:
>> > >>> Your test was based on 0.94.6.1, right ?
>> > >>>
>> > >>> What Filter were you using ?
>> > >>>
>> > >>> If you used SingleColumnValueFilter, have you seen my comment here ?
>> > >>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>> > >>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>> > >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
>> >
>> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
>> > >
>> > >>>
>> > >>> BTW the use case Max Lapan tried to address has non essential column
>> > >>> family
>> > >>> carrying considerably more data compared to essential column family.
>> > >>>
>> > >>> Cheers
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>> [email protected]
>> > >>> >wrote:
>> > >>>
>> > >>>  Hello,
>> > >>>> We're doing some performance testing of the essential column family
>> > >>>> feature, and we're seeing some performance degradation when
>> comparing
>> > >>>> with
>> > >>>> and without the feature enabled:
>> > >>>>
>> > >>>>                            Performance of scan relative
>> > >>>> % of rows selected        to not enabling the feature
>> > >>>> ---------------------    ------------------------------****--
>> > >>>>
>> > >>>> 100%                            1.0x
>> > >>>>   80%                            2.0x
>> > >>>>   60%                            2.3x
>> > >>>>   40%                            2.2x
>> > >>>>   20%                            1.5x
>> > >>>>   10%                            1.0x
>> > >>>>    5%                            0.67x
>> > >>>>    0%                            0.30%
>> > >>>>
>> > >>>> In our scenario, we have two column families. The key value from the
>> > >>>> essential column family is used in the filter, while the key value
>> > from
>> > >>>> the
>> > >>>> other, non essential column family is returned by the scan. Each row
>> > >>>> contains values for both key values, with the values being
>> relatively
>> > >>>> narrow (less than 50 bytes). In this scenario, the only time we're
>> > >>>> seeing a
>> > >>>> performance gain is when less than 10% of the rows are selected.
>> > >>>>
>> > >>>> Is this a reasonable test? Has anyone else measured this?
>> > >>>>
>> > >>>> Thanks,
>> > >>>>
>> > >>>> James
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>
>> > >
>> >
>>

Re: Essential column family performance

Reply via email to