Re: Essential column family performance

lars hofhansl Sun, 07 Apr 2013 20:53:29 -0700

Looking at the joined scanner test code, it sets it up such that 1% of the rows 
match, which would somewhat be in line with James' results.


In my own testing a while ago I found a 100% improvement with 0% match.


-- Lars



________________________________
 From: Ted Yu <[email protected]>
To: [email protected] 
Sent: Sunday, April 7, 2013 4:13 PM
Subject: Re: Essential column family performance
 
I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
reference.

On my MacBook, I got the following results from the test:

2013-04-07 16:08:17,474 INFO  [main] regionserver.TestJoinedScanners(157):
Slow scanner finished in 7.973822 seconds, got 100 rows
...
2013-04-07 16:08:17,946 INFO  [main] regionserver.TestJoinedScanners(157):
Joined scanner finished in 0.47235 seconds, got 100 rows

Cheers

On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote:

> Looking at
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt, I 
> found that it didn't contain TestJoinedScanners which shows
> difference in scanner performance:
>
>    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
> Double.toString(timeSec)
>
>       + " seconds, got " + Long.toString(rows_count/2) + " rows");
>
> The test uses SingleColumnValueFilter:
>
>     SingleColumnValueFilter filter = new SingleColumnValueFilter(
>
>         cf_essential, col_name, CompareFilter.CompareOp.EQUAL, flag_yes);
> It is possible that the custom filter you were using would exhibit
> different access pattern compared to SingleColumnValueFilter. e.g. does
> your filter utilize hint ?
> It would be easier for me and other people to reproduce the issue you
> experienced if you put your scenario in some test similar to
> TestJoinedScanners.
>
> Will take a closer look at the code Monday.
>
> Cheers
>
> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected]>wrote:
>
>> Yes, on 0.94.6. We have our own custom filter derived from FilterBase, so
>> filterIfMissing isn't the issue - the results of the scan are correct.
>>
>> I can see that if the essential column family has more data compared to
>> the non essential column family that the results would eventually even out.
>> I was hoping to always be able to enable the essential column family
>> feature. Is there an inherent reason why performance would degrade like
>> this? Does it boil down to a single sequential scan versus many seeks?
>>
>> Thanks,
>>
>> James
>>
>>
>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>
>>> James:
>>> Your test was based on 0.94.6.1, right ?
>>>
>>> What Filter were you using ?
>>>
>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>
>>> BTW the use case Max Lapan tried to address has non essential column
>>> family
>>> carrying considerably more data compared to essential column family.
>>>
>>> Cheers
>>>
>>>
>>>
>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <[email protected]
>>> >wrote:
>>>
>>>  Hello,
>>>> We're doing some performance testing of the essential column family
>>>> feature, and we're seeing some performance degradation when comparing
>>>> with
>>>> and without the feature enabled:
>>>>
>>>>                            Performance of scan relative
>>>> % of rows selected        to not enabling the feature
>>>> ---------------------    ------------------------------****--
>>>>
>>>> 100%                            1.0x
>>>>   80%                            2.0x
>>>>   60%                            2.3x
>>>>   40%                            2.2x
>>>>   20%                            1.5x
>>>>   10%                            1.0x
>>>>    5%                            0.67x
>>>>    0%                            0.30%
>>>>
>>>> In our scenario, we have two column families. The key value from the
>>>> essential column family is used in the filter, while the key value from
>>>> the
>>>> other, non essential column family is returned by the scan. Each row
>>>> contains values for both key values, with the values being relatively
>>>> narrow (less than 50 bytes). In this scenario, the only time we're
>>>> seeing a
>>>> performance gain is when less than 10% of the rows are selected.
>>>>
>>>> Is this a reasonable test? Has anyone else measured this?
>>>>
>>>> Thanks,
>>>>
>>>> James
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>
>

Re: Essential column family performance

Reply via email to