I made the following change in TestJoinedScanners.java: - int flag_percent = 1; + int flag_percent = 40;
The test took longer but still favors joined scanner. I got some new results: 2013-04-08 07:46:06,959 INFO [main] regionserver.TestJoinedScanners(157): Slow scanner finished in 7.424388 seconds, got 2050 rows ... 2013-04-08 07:46:12,010 INFO [main] regionserver.TestJoinedScanners(157): Joined scanner finished in 5.05063 seconds, got 2050 rows 2013-04-08 07:46:18,358 INFO [main] regionserver.TestJoinedScanners(157): Slow scanner finished in 6.348517 seconds, got 2050 rows ... 2013-04-08 07:46:22,946 INFO [main] regionserver.TestJoinedScanners(157): Joined scanner finished in 4.587545 seconds, got 2050 rows Looks like effectiveness of joined scanner is affected by distribution of data. Cheers On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> wrote: > Looking at the joined scanner test code, it sets it up such that 1% of the > rows match, which would somewhat be in line with James' results. > > In my own testing a while ago I found a 100% improvement with 0% match. > > > -- Lars > > > > ________________________________ > From: Ted Yu <[email protected]> > To: [email protected] > Sent: Sunday, April 7, 2013 4:13 PM > Subject: Re: Essential column family performance > > I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your > reference. > > On my MacBook, I got the following results from the test: > > 2013-04-07 16:08:17,474 INFO [main] regionserver.TestJoinedScanners(157): > Slow scanner finished in 7.973822 seconds, got 100 rows > ... > 2013-04-07 16:08:17,946 INFO [main] regionserver.TestJoinedScanners(157): > Joined scanner finished in 0.47235 seconds, got 100 rows > > Cheers > > On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote: > > > Looking at > > > https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt, > I found that it didn't contain TestJoinedScanners which shows > > difference in scanner performance: > > > > LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " + > > Double.toString(timeSec) > > > > + " seconds, got " + Long.toString(rows_count/2) + " rows"); > > > > The test uses SingleColumnValueFilter: > > > > SingleColumnValueFilter filter = new SingleColumnValueFilter( > > > > cf_essential, col_name, CompareFilter.CompareOp.EQUAL, flag_yes); > > It is possible that the custom filter you were using would exhibit > > different access pattern compared to SingleColumnValueFilter. e.g. does > > your filter utilize hint ? > > It would be easier for me and other people to reproduce the issue you > > experienced if you put your scenario in some test similar to > > TestJoinedScanners. > > > > Will take a closer look at the code Monday. > > > > Cheers > > > > On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected] > >wrote: > > > >> Yes, on 0.94.6. We have our own custom filter derived from FilterBase, > so > >> filterIfMissing isn't the issue - the results of the scan are correct. > >> > >> I can see that if the essential column family has more data compared to > >> the non essential column family that the results would eventually even > out. > >> I was hoping to always be able to enable the essential column family > >> feature. Is there an inherent reason why performance would degrade like > >> this? Does it boil down to a single sequential scan versus many seeks? > >> > >> Thanks, > >> > >> James > >> > >> > >> On 04/07/2013 07:44 AM, Ted Yu wrote: > >> > >>> James: > >>> Your test was based on 0.94.6.1, right ? > >>> > >>> What Filter were you using ? > >>> > >>> If you used SingleColumnValueFilter, have you seen my comment here ? > >>> https://issues.apache.org/**jira/browse/HBASE-5416?** > >>> focusedCommentId=13541229&**page=com.atlassian.jira.** > >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229< > https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229 > > > >>> > >>> BTW the use case Max Lapan tried to address has non essential column > >>> family > >>> carrying considerably more data compared to essential column family. > >>> > >>> Cheers > >>> > >>> > >>> > >>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <[email protected] > >>> >wrote: > >>> > >>> Hello, > >>>> We're doing some performance testing of the essential column family > >>>> feature, and we're seeing some performance degradation when comparing > >>>> with > >>>> and without the feature enabled: > >>>> > >>>> Performance of scan relative > >>>> % of rows selected to not enabling the feature > >>>> --------------------- ------------------------------****-- > >>>> > >>>> 100% 1.0x > >>>> 80% 2.0x > >>>> 60% 2.3x > >>>> 40% 2.2x > >>>> 20% 1.5x > >>>> 10% 1.0x > >>>> 5% 0.67x > >>>> 0% 0.30% > >>>> > >>>> In our scenario, we have two column families. The key value from the > >>>> essential column family is used in the filter, while the key value > from > >>>> the > >>>> other, non essential column family is returned by the scan. Each row > >>>> contains values for both key values, with the values being relatively > >>>> narrow (less than 50 bytes). In this scenario, the only time we're > >>>> seeing a > >>>> performance gain is when less than 10% of the rows are selected. > >>>> > >>>> Is this a reasonable test? Has anyone else measured this? > >>>> > >>>> Thanks, > >>>> > >>>> James > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >> > > >
