Looking at the joined scanner test code, it sets it up such that 1% of the rows match, which would somewhat be in line with James' results.
In my own testing a while ago I found a 100% improvement with 0% match. -- Lars ________________________________ From: Ted Yu <[email protected]> To: [email protected] Sent: Sunday, April 7, 2013 4:13 PM Subject: Re: Essential column family performance I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your reference. On my MacBook, I got the following results from the test: 2013-04-07 16:08:17,474 INFO [main] regionserver.TestJoinedScanners(157): Slow scanner finished in 7.973822 seconds, got 100 rows ... 2013-04-07 16:08:17,946 INFO [main] regionserver.TestJoinedScanners(157): Joined scanner finished in 0.47235 seconds, got 100 rows Cheers On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote: > Looking at > https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt, I > found that it didn't contain TestJoinedScanners which shows > difference in scanner performance: > > LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " + > Double.toString(timeSec) > > + " seconds, got " + Long.toString(rows_count/2) + " rows"); > > The test uses SingleColumnValueFilter: > > SingleColumnValueFilter filter = new SingleColumnValueFilter( > > cf_essential, col_name, CompareFilter.CompareOp.EQUAL, flag_yes); > It is possible that the custom filter you were using would exhibit > different access pattern compared to SingleColumnValueFilter. e.g. does > your filter utilize hint ? > It would be easier for me and other people to reproduce the issue you > experienced if you put your scenario in some test similar to > TestJoinedScanners. > > Will take a closer look at the code Monday. > > Cheers > > On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected]>wrote: > >> Yes, on 0.94.6. We have our own custom filter derived from FilterBase, so >> filterIfMissing isn't the issue - the results of the scan are correct. >> >> I can see that if the essential column family has more data compared to >> the non essential column family that the results would eventually even out. >> I was hoping to always be able to enable the essential column family >> feature. Is there an inherent reason why performance would degrade like >> this? Does it boil down to a single sequential scan versus many seeks? >> >> Thanks, >> >> James >> >> >> On 04/07/2013 07:44 AM, Ted Yu wrote: >> >>> James: >>> Your test was based on 0.94.6.1, right ? >>> >>> What Filter were you using ? >>> >>> If you used SingleColumnValueFilter, have you seen my comment here ? >>> https://issues.apache.org/**jira/browse/HBASE-5416?** >>> focusedCommentId=13541229&**page=com.atlassian.jira.** >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229> >>> >>> BTW the use case Max Lapan tried to address has non essential column >>> family >>> carrying considerably more data compared to essential column family. >>> >>> Cheers >>> >>> >>> >>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <[email protected] >>> >wrote: >>> >>> Hello, >>>> We're doing some performance testing of the essential column family >>>> feature, and we're seeing some performance degradation when comparing >>>> with >>>> and without the feature enabled: >>>> >>>> Performance of scan relative >>>> % of rows selected to not enabling the feature >>>> --------------------- ------------------------------****-- >>>> >>>> 100% 1.0x >>>> 80% 2.0x >>>> 60% 2.3x >>>> 40% 2.2x >>>> 20% 1.5x >>>> 10% 1.0x >>>> 5% 0.67x >>>> 0% 0.30% >>>> >>>> In our scenario, we have two column families. The key value from the >>>> essential column family is used in the filter, while the key value from >>>> the >>>> other, non essential column family is returned by the scan. Each row >>>> contains values for both key values, with the values being relatively >>>> narrow (less than 50 bytes). In this scenario, the only time we're >>>> seeing a >>>> performance gain is when less than 10% of the rows are selected. >>>> >>>> Is this a reasonable test? Has anyone else measured this? >>>> >>>> Thanks, >>>> >>>> James >>>> >>>> >>>> >>>> >>>> >>>> >>>> >> >
