I adopted random distribution for 30% of the rows which were selected.
I still saw meaningful improvement from joined scanners:

2013-04-08 10:54:13,819 INFO  [main] regionserver.TestJoinedScanners(158):
Slow scanner finished in 6.20723 seconds, got 1552 rows
...
2013-04-08 10:54:18,801 INFO  [main] regionserver.TestJoinedScanners(158):
Joined scanner finished in 4.982732 seconds, got 1552 rows

2013-04-08 10:54:23,997 INFO  [main] regionserver.TestJoinedScanners(158):
Slow scanner finished in 5.195658 seconds, got 1552 rows
...
2013-04-08 10:54:28,619 INFO  [main] regionserver.TestJoinedScanners(158):
Joined scanner finished in 4.621337 seconds, got 1552 rows

Cheers

On Mon, Apr 8, 2013 at 10:42 AM, Ted Yu <[email protected]> wrote:

> bq. is the 40% randomly distributed or sequential?
> Looks like the distribution is striped:
>
>         if (i % 100 <= flag_percent) {
>
>           put.add(cf_essential, col_name, flag_yes);
> In each stripe, it is sequential.
>
> Let me try simulating random distribution.
>
> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[email protected]>wrote:
>
>> In the TestJoinedScanners.java, is the 40% randomly distributed or
>> sequential?
>>
>> In our test, the % is randomly distributed. Also, our custom filter does
>> the same thing that SingleColumnValueFilter does.  On the client-side, we'd
>> execute the query in parallel, through multiple scans along the region
>> boundaries. Would that have a negative impact on performance for this
>> "essential column family" feature?
>>
>> Thanks,
>>
>>     James
>>
>>
>> On 04/08/2013 10:10 AM, Anoop John wrote:
>>
>>> Agree here. The effectiveness depends on what % of data satisfies the
>>> condition, how it is distributed across HFile blocks. We will get
>>> performance gain when the we will be able to skip some HFile blocks (from
>>> non essential CFs). Can test with different HFile block size (lower
>>> value)?
>>>
>>> -Anoop-
>>>
>>>
>>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote:
>>>
>>>  I made the following change in TestJoinedScanners.java:
>>>>
>>>> -      int flag_percent = 1;
>>>> +      int flag_percent = 40;
>>>>
>>>> The test took longer but still favors joined scanner.
>>>> I got some new results:
>>>>
>>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
>>>>
>>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
>>>> ...
>>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
>>>> TestJoinedScanners(157):
>>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
>>>>
>>>> Looks like effectiveness of joined scanner is affected by distribution
>>>> of
>>>> data.
>>>>
>>>> Cheers
>>>>
>>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> wrote:
>>>>
>>>>  Looking at the joined scanner test code, it sets it up such that 1% of
>>>>>
>>>> the
>>>>
>>>>> rows match, which would somewhat be in line with James' results.
>>>>>
>>>>> In my own testing a while ago I found a 100% improvement with 0% match.
>>>>>
>>>>>
>>>>> -- Lars
>>>>>
>>>>>
>>>>>
>>>>> ______________________________**__
>>>>>   From: Ted Yu <[email protected]>
>>>>> To: [email protected]
>>>>> Sent: Sunday, April 7, 2013 4:13 PM
>>>>> Subject: Re: Essential column family performance
>>>>>
>>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for
>>>>> your
>>>>> reference.
>>>>>
>>>>> On my MacBook, I got the following results from the test:
>>>>>
>>>>> 2013-04-07 16:08:17,474 INFO  [main]
>>>>>
>>>> regionserver.**TestJoinedScanners(157):
>>>>
>>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
>>>>> ...
>>>>> 2013-04-07 16:08:17,946 INFO  [main]
>>>>>
>>>> regionserver.**TestJoinedScanners(157):
>>>>
>>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote:
>>>>>
>>>>>  Looking at
>>>>>>
>>>>>>  https://issues.apache.org/**jira/secure/attachment/**
>>>> 12564340/5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt>
>>>> ,
>>>>
>>>>> I found that it didn't contain TestJoinedScanners which shows
>>>>>
>>>>>> difference in scanner performance:
>>>>>>
>>>>>>     LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
>>>>>> Double.toString(timeSec)
>>>>>>
>>>>>>        + " seconds, got " + Long.toString(rows_count/2) + " rows");
>>>>>>
>>>>>> The test uses SingleColumnValueFilter:
>>>>>>
>>>>>>      SingleColumnValueFilter filter = new SingleColumnValueFilter(
>>>>>>
>>>>>>          cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
>>>>>>
>>>>> flag_yes);
>>>>
>>>>> It is possible that the custom filter you were using would exhibit
>>>>>> different access pattern compared to SingleColumnValueFilter. e.g.
>>>>>> does
>>>>>> your filter utilize hint ?
>>>>>> It would be easier for me and other people to reproduce the issue you
>>>>>> experienced if you put your scenario in some test similar to
>>>>>> TestJoinedScanners.
>>>>>>
>>>>>> Will take a closer look at the code Monday.
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected]
>>>>>> wrote:
>>>>>>
>>>>>>  Yes, on 0.94.6. We have our own custom filter derived from
>>>>>>> FilterBase,
>>>>>>>
>>>>>> so
>>>>>
>>>>>> filterIfMissing isn't the issue - the results of the scan are correct.
>>>>>>>
>>>>>>> I can see that if the essential column family has more data compared
>>>>>>>
>>>>>> to
>>>>
>>>>> the non essential column family that the results would eventually even
>>>>>>>
>>>>>> out.
>>>>>
>>>>>> I was hoping to always be able to enable the essential column family
>>>>>>> feature. Is there an inherent reason why performance would degrade
>>>>>>>
>>>>>> like
>>>>
>>>>> this? Does it boil down to a single sequential scan versus many seeks?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> James
>>>>>>>
>>>>>>>
>>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
>>>>>>>
>>>>>>>  James:
>>>>>>>> Your test was based on 0.94.6.1, right ?
>>>>>>>>
>>>>>>>> What Filter were you using ?
>>>>>>>>
>>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ?
>>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**>
>>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
>>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
>>>>>>>> 13541229<
>>>>>>>>
>>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>>>
>>>>>  BTW the use case Max Lapan tried to address has non essential column
>>>>>>>> family
>>>>>>>> carrying considerably more data compared to essential column family.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
>>>>>>>>
>>>>>>> [email protected]
>>>>
>>>>>  wrote:
>>>>>>>>>
>>>>>>>>   Hello,
>>>>>>>>
>>>>>>>>> We're doing some performance testing of the essential column family
>>>>>>>>> feature, and we're seeing some performance degradation when
>>>>>>>>>
>>>>>>>> comparing
>>>>
>>>>>  with
>>>>>>>>> and without the feature enabled:
>>>>>>>>>
>>>>>>>>>                             Performance of scan relative
>>>>>>>>> % of rows selected        to not enabling the feature
>>>>>>>>> ---------------------    ------------------------------******--
>>>>>>>>>
>>>>>>>>> 100%                            1.0x
>>>>>>>>>    80%                            2.0x
>>>>>>>>>    60%                            2.3x
>>>>>>>>>    40%                            2.2x
>>>>>>>>>    20%                            1.5x
>>>>>>>>>    10%                            1.0x
>>>>>>>>>     5%                            0.67x
>>>>>>>>>     0%                            0.30%
>>>>>>>>>
>>>>>>>>> In our scenario, we have two column families. The key value from
>>>>>>>>> the
>>>>>>>>> essential column family is used in the filter, while the key value
>>>>>>>>>
>>>>>>>> from
>>>>>
>>>>>>  the
>>>>>>>>> other, non essential column family is returned by the scan. Each
>>>>>>>>> row
>>>>>>>>> contains values for both key values, with the values being
>>>>>>>>>
>>>>>>>> relatively
>>>>
>>>>>  narrow (less than 50 bytes). In this scenario, the only time we're
>>>>>>>>> seeing a
>>>>>>>>> performance gain is when less than 10% of the rows are selected.
>>>>>>>>>
>>>>>>>>> Is this a reasonable test? Has anyone else measured this?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> James
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>
>

Reply via email to