Looking at
https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt,
I found that it didn't contain TestJoinedScanners which shows
difference in scanner performance:
LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
Double.toString(timeSec)
+ " seconds, got " + Long.toString(rows_count/2) + " rows");
The test uses SingleColumnValueFilter:
SingleColumnValueFilter filter = new SingleColumnValueFilter(
cf_essential, col_name, CompareFilter.CompareOp.EQUAL, flag_yes);
It is possible that the custom filter you were using would exhibit
different access pattern compared to SingleColumnValueFilter. e.g. does
your filter utilize hint ?
It would be easier for me and other people to reproduce the issue you
experienced if you put your scenario in some test similar to
TestJoinedScanners.
Will take a closer look at the code Monday.
Cheers
On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected]>wrote:
> Yes, on 0.94.6. We have our own custom filter derived from FilterBase, so
> filterIfMissing isn't the issue - the results of the scan are correct.
>
> I can see that if the essential column family has more data compared to
> the non essential column family that the results would eventually even out.
> I was hoping to always be able to enable the essential column family
> feature. Is there an inherent reason why performance would degrade like
> this? Does it boil down to a single sequential scan versus many seeks?
>
> Thanks,
>
> James
>
>
> On 04/07/2013 07:44 AM, Ted Yu wrote:
>
>> James:
>> Your test was based on 0.94.6.1, right ?
>>
>> What Filter were you using ?
>>
>> If you used SingleColumnValueFilter, have you seen my comment here ?
>> https://issues.apache.org/**jira/browse/HBASE-5416?**
>> focusedCommentId=13541229&**page=com.atlassian.jira.**
>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229>
>>
>> BTW the use case Max Lapan tried to address has non essential column
>> family
>> carrying considerably more data compared to essential column family.
>>
>> Cheers
>>
>>
>>
>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <[email protected]
>> >wrote:
>>
>> Hello,
>>> We're doing some performance testing of the essential column family
>>> feature, and we're seeing some performance degradation when comparing
>>> with
>>> and without the feature enabled:
>>>
>>> Performance of scan relative
>>> % of rows selected to not enabling the feature
>>> --------------------- ------------------------------****--
>>>
>>> 100% 1.0x
>>> 80% 2.0x
>>> 60% 2.3x
>>> 40% 2.2x
>>> 20% 1.5x
>>> 10% 1.0x
>>> 5% 0.67x
>>> 0% 0.30%
>>>
>>> In our scenario, we have two column families. The key value from the
>>> essential column family is used in the filter, while the key value from
>>> the
>>> other, non essential column family is returned by the scan. Each row
>>> contains values for both key values, with the values being relatively
>>> narrow (less than 50 bytes). In this scenario, the only time we're
>>> seeing a
>>> performance gain is when less than 10% of the rows are selected.
>>>
>>> Is this a reasonable test? Has anyone else measured this?
>>>
>>> Thanks,
>>>
>>> James
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>