bq. is the 40% randomly distributed or sequential?
Looks like the distribution is striped:
if (i % 100 <= flag_percent) {
put.add(cf_essential, col_name, flag_yes);
In each stripe, it is sequential.
Let me try simulating random distribution.
On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[email protected]>wrote:
> In the TestJoinedScanners.java, is the 40% randomly distributed or
> sequential?
>
> In our test, the % is randomly distributed. Also, our custom filter does
> the same thing that SingleColumnValueFilter does. On the client-side, we'd
> execute the query in parallel, through multiple scans along the region
> boundaries. Would that have a negative impact on performance for this
> "essential column family" feature?
>
> Thanks,
>
> James
>
>
> On 04/08/2013 10:10 AM, Anoop John wrote:
>
>> Agree here. The effectiveness depends on what % of data satisfies the
>> condition, how it is distributed across HFile blocks. We will get
>> performance gain when the we will be able to skip some HFile blocks (from
>> non essential CFs). Can test with different HFile block size (lower
>> value)?
>>
>> -Anoop-
>>
>>
>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote:
>>
>> I made the following change in TestJoinedScanners.java:
>>>
>>> - int flag_percent = 1;
>>> + int flag_percent = 40;
>>>
>>> The test took longer but still favors joined scanner.
>>> I got some new results:
>>>
>>> 2013-04-08 07:46:06,959 INFO [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:12,010 INFO [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>
>>> 2013-04-08 07:46:18,358 INFO [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>> ...
>>> 2013-04-08 07:46:22,946 INFO [main] regionserver.**
>>> TestJoinedScanners(157):
>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>
>>> Looks like effectiveness of joined scanner is affected by distribution of
>>> data.
>>>
>>> Cheers
>>>
>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> wrote:
>>>
>>> Looking at the joined scanner test code, it sets it up such that 1% of
>>>>
>>> the
>>>
>>>> rows match, which would somewhat be in line with James' results.
>>>>
>>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>>
>>>>
>>>> -- Lars
>>>>
>>>>
>>>>
>>>> ______________________________**__
>>>> From: Ted Yu <[email protected]>
>>>> To: [email protected]
>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>> Subject: Re: Essential column family performance
>>>>
>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
>>>> your
>>>> reference.
>>>>
>>>> On my MacBook, I got the following results from the test:
>>>>
>>>> 2013-04-07 16:08:17,474 INFO [main]
>>>>
>>> regionserver.**TestJoinedScanners(157):
>>>
>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>> ...
>>>> 2013-04-07 16:08:17,946 INFO [main]
>>>>
>>> regionserver.**TestJoinedScanners(157):
>>>
>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote:
>>>>
>>>> Looking at
>>>>>
>>>>> https://issues.apache.org/**jira/secure/attachment/**
>>> 12564340/5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>> ,
>>>
>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>
>>>>> difference in scanner performance:
>>>>>
>>>>> LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>> Double.toString(timeSec)
>>>>>
>>>>> + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>
>>>>> The test uses SingleColumnValueFilter:
>>>>>
>>>>> SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>
>>>>> cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>>>
>>>> flag_yes);
>>>
>>>> It is possible that the custom filter you were using would exhibit
>>>>> different access pattern compared to SingleColumnValueFilter. e.g. does
>>>>> your filter utilize hint ?
>>>>> It would be easier for me and other people to reproduce the issue you
>>>>> experienced if you put your scenario in some test similar to
>>>>> TestJoinedScanners.
>>>>>
>>>>> Will take a closer look at the code Monday.
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected]
>>>>> wrote:
>>>>>
>>>>> Yes, on 0.94.6. We have our own custom filter derived from FilterBase,
>>>>>>
>>>>> so
>>>>
>>>>> filterIfMissing isn't the issue - the results of the scan are correct.
>>>>>>
>>>>>> I can see that if the essential column family has more data compared
>>>>>>
>>>>> to
>>>
>>>> the non essential column family that the results would eventually even
>>>>>>
>>>>> out.
>>>>
>>>>> I was hoping to always be able to enable the essential column family
>>>>>> feature. Is there an inherent reason why performance would degrade
>>>>>>
>>>>> like
>>>
>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>
>>>>>> James:
>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>
>>>>>>> What Filter were you using ?
>>>>>>>
>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
>>>>>>> 13541229<
>>>>>>>
>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>
>>>> BTW the use case Max Lapan tried to address has non essential column
>>>>>>> family
>>>>>>> carrying considerably more data compared to essential column family.
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>>>>>
>>>>>> [email protected]
>>>
>>>> wrote:
>>>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>>> We're doing some performance testing of the essential column family
>>>>>>>> feature, and we're seeing some performance degradation when
>>>>>>>>
>>>>>>> comparing
>>>
>>>> with
>>>>>>>> and without the feature enabled:
>>>>>>>>
>>>>>>>> Performance of scan relative
>>>>>>>> % of rows selected to not enabling the feature
>>>>>>>> --------------------- ------------------------------******--
>>>>>>>>
>>>>>>>> 100% 1.0x
>>>>>>>> 80% 2.0x
>>>>>>>> 60% 2.3x
>>>>>>>> 40% 2.2x
>>>>>>>> 20% 1.5x
>>>>>>>> 10% 1.0x
>>>>>>>> 5% 0.67x
>>>>>>>> 0% 0.30%
>>>>>>>>
>>>>>>>> In our scenario, we have two column families. The key value from the
>>>>>>>> essential column family is used in the filter, while the key value
>>>>>>>>
>>>>>>> from
>>>>
>>>>> the
>>>>>>>> other, non essential column family is returned by the scan. Each row
>>>>>>>> contains values for both key values, with the values being
>>>>>>>>
>>>>>>> relatively
>>>
>>>> narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>>> seeing a
>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>
>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> James
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>