Re: Essential column family performance

Ted Yu Mon, 08 Apr 2013 19:51:49 -0700

Using 30% selection rate, random distribution and FAST_DIFF encoding on
both column families, I got:


2013-04-08 19:46:21,802 INFO  [main] regionserver.TestJoinedScanners(166):
Slow scanner finished in 5.251182 seconds, got 1547 rows
...
2013-04-08 19:46:26,661 INFO  [main] regionserver.TestJoinedScanners(166):
Joined scanner finished in 4.858834 seconds, got 1547 rows

2013-04-08 19:46:31,891 INFO  [main] regionserver.TestJoinedScanners(166):
Slow scanner finished in 5.22988 seconds, got 1547 rows
...
2013-04-08 19:46:36,566 INFO  [main] regionserver.TestJoinedScanners(166):
Joined scanner finished in 4.674822 seconds, got 1547 rows

Cheers

On Mon, Apr 8, 2013 at 6:53 PM, James Taylor <[email protected]> wrote:

> Good idea, Sergey. We'll rerun with larger non essential column family
> values and see if there's a crossover point. One other difference for us is
> that we're using FAST_DIFF encoding. We'll try with no encoding too. Our
> table has 20 million rows across four regions servers.
>
> Regarding the parallelization we do, we run multiple scans in parallel
> instead of one single scan over the table. We use the region boundaries of
> the table to divide up the work evenly, adding a start/stop key for each
> scan that corresponds to the region boundaries. Our client then does a
> final merge/aggregation step (i.e. adding up the count it gets back from
> the scan for each region).
>
>
> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
>
>> IntegrationTestLazyCfLoading uses randomly distributed keys with the
>> following condition for filtering:
>> 1 == (Long.parseLong(Bytes.**toString(rowKey, 0, 4), 16) & 1); where
>> rowKey
>> is hex string of MD5 key.
>> Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
>> This test also showed significant improvement IIRC, so random distribution
>> and high %%ge of values selected should not be a problem as such.
>>
>> My hunch would be that the additional cost of seeks/merging the results
>> from two CFs outweights the benefit of lazy loading on such small values
>> for the "lazy" CF with lots of data selected. This feature definitely
>> makes
>> no sense if you are selecting all values, because then extra work is being
>> done for no benefit (everything is read anyway).
>> So the use cases would be larger "lazy" CFs or/and low percentage of
>> values
>> selected.
>>
>> Can you try to increase the 2nd CF values' size and rerun the test?
>>
>>
>> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[email protected]
>> >wrote:
>>
>>  In the TestJoinedScanners.java, is the 40% randomly distributed or
>>> sequential?
>>>
>>> In our test, the % is randomly distributed. Also, our custom filter does
>>> the same thing that SingleColumnValueFilter does.  On the client-side,
>>> we'd
>>> execute the query in parallel, through multiple scans along the region
>>> boundaries. Would that have a negative impact on performance for this
>>> "essential column family" feature?
>>>
>>> Thanks,
>>>
>>>      James
>>>
>>>
>>> On 04/08/2013 10:10 AM, Anoop John wrote:
>>>
>>>  Agree here. The effectiveness depends on what % of data satisfies the
>>>> condition, how it is distributed across HFile blocks. We will get
>>>> performance gain when the we will be able to skip some HFile blocks
>>>> (from
>>>> non essential CFs). Can test with different HFile block size (lower
>>>> value)?
>>>>
>>>> -Anoop-
>>>>
>>>>
>>>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote:
>>>>
>>>>   I made the following change in TestJoinedScanners.java:
>>>>
>>>>> -      int flag_percent = 1;
>>>>> +      int flag_percent = 40;
>>>>>
>>>>> The test took longer but still favors joined scanner.
>>>>> I got some new results:
>>>>>
>>>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>>>> TestJoinedScanners(157):
>>>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>>>> ...
>>>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>>>> TestJoinedScanners(157):
>>>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>>>
>>>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>>>> TestJoinedScanners(157):
>>>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>>>> ...
>>>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>>>> TestJoinedScanners(157):
>>>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>>>
>>>>> Looks like effectiveness of joined scanner is affected by distribution
>>>>> of
>>>>> data.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]>
>>>>> wrote:
>>>>>
>>>>>   Looking at the joined scanner test code, it sets it up such that 1%
>>>>> of
>>>>> the
>>>>>
>>>>>  rows match, which would somewhat be in line with James' results.
>>>>>>
>>>>>> In my own testing a while ago I found a 100% improvement with 0%
>>>>>> match.
>>>>>>
>>>>>>
>>>>>> -- Lars
>>>>>>
>>>>>>
>>>>>>
>>>>>> ______________________________****__
>>>>>>    From: Ted Yu <[email protected]>
>>>>>> To: [email protected]
>>>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>>>> Subject: Re: Essential column family performance
>>>>>>
>>>>>> I have attached 5416-TestJoinedScanners-0.94.****txt to HBASE-5416
>>>>>> for
>>>>>> your
>>>>>> reference.
>>>>>>
>>>>>> On my MacBook, I got the following results from the test:
>>>>>>
>>>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>>>>
>>>>>>  regionserver.****TestJoinedScanners(157):
>>>>>
>>>>>  Slow scanner finished in 7.973822 seconds, got 100 rows
>>>>>> ...
>>>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>>>>
>>>>>>  regionserver.****TestJoinedScanners(157):
>>>>>
>>>>>  Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote:
>>>>>>
>>>>>>   Looking at
>>>>>>
>>>>>>>   
>>>>>>> https://issues.apache.org/****jira/secure/attachment/**<https://issues.apache.org/**jira/secure/attachment/**>
>>>>>>>
>>>>>> 12564340/5416-0.94-v3.txt<http**s://issues.apache.org/jira/**
>>>>> secure/attachment/12564340/**5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>>>> >
>>>>> ,
>>>>>
>>>>>  I found that it didn't contain TestJoinedScanners which shows
>>>>>>
>>>>>>  difference in scanner performance:
>>>>>>>
>>>>>>>      LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>>>> Double.toString(timeSec)
>>>>>>>
>>>>>>>         + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>>>
>>>>>>> The test uses SingleColumnValueFilter:
>>>>>>>
>>>>>>>       SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>>>
>>>>>>>           cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>>>>>
>>>>>>>  flag_yes);
>>>>>> It is possible that the custom filter you were using would exhibit
>>>>>>
>>>>>>> different access pattern compared to SingleColumnValueFilter. e.g.
>>>>>>> does
>>>>>>> your filter utilize hint ?
>>>>>>> It would be easier for me and other people to reproduce the issue you
>>>>>>> experienced if you put your scenario in some test similar to
>>>>>>> TestJoinedScanners.
>>>>>>>
>>>>>>> Will take a closer look at the code Monday.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <
>>>>>>> [email protected]
>>>>>>> wrote:
>>>>>>>
>>>>>>>   Yes, on 0.94.6. We have our own custom filter derived from
>>>>>>> FilterBase,
>>>>>>> so
>>>>>>> filterIfMissing isn't the issue - the results of the scan are
>>>>>>> correct.
>>>>>>>
>>>>>>>> I can see that if the essential column family has more data compared
>>>>>>>>
>>>>>>>>  to
>>>>>>>
>>>>>> the non essential column family that the results would eventually even
>>>>>>
>>>>>>> out.
>>>>>>> I was hoping to always be able to enable the essential column family
>>>>>>>
>>>>>>>> feature. Is there an inherent reason why performance would degrade
>>>>>>>>
>>>>>>>>  like
>>>>>>>
>>>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>>
>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>>
>>>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>>>
>>>>>>>>   James:
>>>>>>>>
>>>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>>>
>>>>>>>>> What Filter were you using ?
>>>>>>>>>
>>>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here
>>>>>>>>> ?
>>>>>>>>> https://issues.apache.org/******jira/browse/HBASE-5416?**<https://issues.apache.org/****jira/browse/HBASE-5416?**>
>>>>>>>>> <http**s://issues.apache.org/**jira/**browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>>>> >
>>>>>>>>> focusedCommentId=13541229&******page=com.atlassian.jira.**
>>>>>>>>> plugin.system.issuetabpanels:******comment-tabpanel#comment-******
>>>>>>>>> 13541229<
>>>>>>>>>
>>>>>>>>>  
>>>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>>>
>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
>>>>> 13541229<https://issues.**apache.org/jira/browse/HBASE-**
>>>>> 5416?focusedCommentId=**13541229&page=com.atlassian.**
>>>>> jira.plugin.system.**issuetabpanels:comment-**
>>>>> tabpanel#comment-13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>>> >
>>>>>
>>>>>    BTW the use case Max Lapan tried to address has non essential column
>>>>>>
>>>>>>> family
>>>>>>>>> carrying considerably more data compared to essential column
>>>>>>>>> family.
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>>>>>>>
>>>>>>>>>  [email protected]
>>>>>>>>
>>>>>>>   wrote:
>>>>>>
>>>>>>>    Hello,
>>>>>>>>>
>>>>>>>>>  We're doing some performance testing of the essential column
>>>>>>>>>> family
>>>>>>>>>> feature, and we're seeing some performance degradation when
>>>>>>>>>>
>>>>>>>>>>  comparing
>>>>>>>>>
>>>>>>>>   with
>>>>>>
>>>>>>> and without the feature enabled:
>>>>>>>>>>
>>>>>>>>>>                              Performance of scan relative
>>>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>>>> ---------------------    ------------------------------********--
>>>>>>>>>>
>>>>>>>>>> 100%                            1.0x
>>>>>>>>>>     80%                            2.0x
>>>>>>>>>>     60%                            2.3x
>>>>>>>>>>     40%                            2.2x
>>>>>>>>>>     20%                            1.5x
>>>>>>>>>>     10%                            1.0x
>>>>>>>>>>      5%                            0.67x
>>>>>>>>>>      0%                            0.30%
>>>>>>>>>>
>>>>>>>>>> In our scenario, we have two column families. The key value from
>>>>>>>>>> the
>>>>>>>>>> essential column family is used in the filter, while the key value
>>>>>>>>>>
>>>>>>>>>>  from
>>>>>>>>>
>>>>>>>>   the
>>>>>>>
>>>>>>>> other, non essential column family is returned by the scan. Each row
>>>>>>>>>> contains values for both key values, with the values being
>>>>>>>>>>
>>>>>>>>>>  relatively
>>>>>>>>>
>>>>>>>>   narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>
>>>>>>> seeing a
>>>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>>>
>>>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>>
>>>>>>>>>> James
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>

Re: Essential column family performance

Reply via email to