I adopted random distribution for 30% of the rows which were selected. I still saw meaningful improvement from joined scanners:
2013-04-08 10:54:13,819 INFO [main] regionserver.TestJoinedScanners(158): Slow scanner finished in 6.20723 seconds, got 1552 rows ... 2013-04-08 10:54:18,801 INFO [main] regionserver.TestJoinedScanners(158): Joined scanner finished in 4.982732 seconds, got 1552 rows 2013-04-08 10:54:23,997 INFO [main] regionserver.TestJoinedScanners(158): Slow scanner finished in 5.195658 seconds, got 1552 rows ... 2013-04-08 10:54:28,619 INFO [main] regionserver.TestJoinedScanners(158): Joined scanner finished in 4.621337 seconds, got 1552 rows Cheers On Mon, Apr 8, 2013 at 10:42 AM, Ted Yu <[email protected]> wrote: > bq. is the 40% randomly distributed or sequential? > Looks like the distribution is striped: > > if (i % 100 <= flag_percent) { > > put.add(cf_essential, col_name, flag_yes); > In each stripe, it is sequential. > > Let me try simulating random distribution. > > On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[email protected]>wrote: > >> In the TestJoinedScanners.java, is the 40% randomly distributed or >> sequential? >> >> In our test, the % is randomly distributed. Also, our custom filter does >> the same thing that SingleColumnValueFilter does. On the client-side, we'd >> execute the query in parallel, through multiple scans along the region >> boundaries. Would that have a negative impact on performance for this >> "essential column family" feature? >> >> Thanks, >> >> James >> >> >> On 04/08/2013 10:10 AM, Anoop John wrote: >> >>> Agree here. The effectiveness depends on what % of data satisfies the >>> condition, how it is distributed across HFile blocks. We will get >>> performance gain when the we will be able to skip some HFile blocks (from >>> non essential CFs). Can test with different HFile block size (lower >>> value)? >>> >>> -Anoop- >>> >>> >>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote: >>> >>> I made the following change in TestJoinedScanners.java: >>>> >>>> - int flag_percent = 1; >>>> + int flag_percent = 40; >>>> >>>> The test took longer but still favors joined scanner. >>>> I got some new results: >>>> >>>> 2013-04-08 07:46:06,959 INFO [main] regionserver.** >>>> TestJoinedScanners(157): >>>> Slow scanner finished in 7.424388 seconds, got 2050 rows >>>> ... >>>> 2013-04-08 07:46:12,010 INFO [main] regionserver.** >>>> TestJoinedScanners(157): >>>> Joined scanner finished in 5.05063 seconds, got 2050 rows >>>> >>>> 2013-04-08 07:46:18,358 INFO [main] regionserver.** >>>> TestJoinedScanners(157): >>>> Slow scanner finished in 6.348517 seconds, got 2050 rows >>>> ... >>>> 2013-04-08 07:46:22,946 INFO [main] regionserver.** >>>> TestJoinedScanners(157): >>>> Joined scanner finished in 4.587545 seconds, got 2050 rows >>>> >>>> Looks like effectiveness of joined scanner is affected by distribution >>>> of >>>> data. >>>> >>>> Cheers >>>> >>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> wrote: >>>> >>>> Looking at the joined scanner test code, it sets it up such that 1% of >>>>> >>>> the >>>> >>>>> rows match, which would somewhat be in line with James' results. >>>>> >>>>> In my own testing a while ago I found a 100% improvement with 0% match. >>>>> >>>>> >>>>> -- Lars >>>>> >>>>> >>>>> >>>>> ______________________________**__ >>>>> From: Ted Yu <[email protected]> >>>>> To: [email protected] >>>>> Sent: Sunday, April 7, 2013 4:13 PM >>>>> Subject: Re: Essential column family performance >>>>> >>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for >>>>> your >>>>> reference. >>>>> >>>>> On my MacBook, I got the following results from the test: >>>>> >>>>> 2013-04-07 16:08:17,474 INFO [main] >>>>> >>>> regionserver.**TestJoinedScanners(157): >>>> >>>>> Slow scanner finished in 7.973822 seconds, got 100 rows >>>>> ... >>>>> 2013-04-07 16:08:17,946 INFO [main] >>>>> >>>> regionserver.**TestJoinedScanners(157): >>>> >>>>> Joined scanner finished in 0.47235 seconds, got 100 rows >>>>> >>>>> Cheers >>>>> >>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote: >>>>> >>>>> Looking at >>>>>> >>>>>> https://issues.apache.org/**jira/secure/attachment/** >>>> 12564340/5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt> >>>> , >>>> >>>>> I found that it didn't contain TestJoinedScanners which shows >>>>> >>>>>> difference in scanner performance: >>>>>> >>>>>> LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " + >>>>>> Double.toString(timeSec) >>>>>> >>>>>> + " seconds, got " + Long.toString(rows_count/2) + " rows"); >>>>>> >>>>>> The test uses SingleColumnValueFilter: >>>>>> >>>>>> SingleColumnValueFilter filter = new SingleColumnValueFilter( >>>>>> >>>>>> cf_essential, col_name, CompareFilter.CompareOp.EQUAL, >>>>>> >>>>> flag_yes); >>>> >>>>> It is possible that the custom filter you were using would exhibit >>>>>> different access pattern compared to SingleColumnValueFilter. e.g. >>>>>> does >>>>>> your filter utilize hint ? >>>>>> It would be easier for me and other people to reproduce the issue you >>>>>> experienced if you put your scenario in some test similar to >>>>>> TestJoinedScanners. >>>>>> >>>>>> Will take a closer look at the code Monday. >>>>>> >>>>>> Cheers >>>>>> >>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected] >>>>>> wrote: >>>>>> >>>>>> Yes, on 0.94.6. We have our own custom filter derived from >>>>>>> FilterBase, >>>>>>> >>>>>> so >>>>> >>>>>> filterIfMissing isn't the issue - the results of the scan are correct. >>>>>>> >>>>>>> I can see that if the essential column family has more data compared >>>>>>> >>>>>> to >>>> >>>>> the non essential column family that the results would eventually even >>>>>>> >>>>>> out. >>>>> >>>>>> I was hoping to always be able to enable the essential column family >>>>>>> feature. Is there an inherent reason why performance would degrade >>>>>>> >>>>>> like >>>> >>>>> this? Does it boil down to a single sequential scan versus many seeks? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> James >>>>>>> >>>>>>> >>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote: >>>>>>> >>>>>>> James: >>>>>>>> Your test was based on 0.94.6.1, right ? >>>>>>>> >>>>>>>> What Filter were you using ? >>>>>>>> >>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here ? >>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**> >>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.** >>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-**** >>>>>>>> 13541229< >>>>>>>> >>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?** >>>> focusedCommentId=13541229&**page=com.atlassian.jira.** >>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229> >>>> >>>>> BTW the use case Max Lapan tried to address has non essential column >>>>>>>> family >>>>>>>> carrying considerably more data compared to essential column family. >>>>>>>> >>>>>>>> Cheers >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor < >>>>>>>> >>>>>>> [email protected] >>>> >>>>> wrote: >>>>>>>>> >>>>>>>> Hello, >>>>>>>> >>>>>>>>> We're doing some performance testing of the essential column family >>>>>>>>> feature, and we're seeing some performance degradation when >>>>>>>>> >>>>>>>> comparing >>>> >>>>> with >>>>>>>>> and without the feature enabled: >>>>>>>>> >>>>>>>>> Performance of scan relative >>>>>>>>> % of rows selected to not enabling the feature >>>>>>>>> --------------------- ------------------------------******-- >>>>>>>>> >>>>>>>>> 100% 1.0x >>>>>>>>> 80% 2.0x >>>>>>>>> 60% 2.3x >>>>>>>>> 40% 2.2x >>>>>>>>> 20% 1.5x >>>>>>>>> 10% 1.0x >>>>>>>>> 5% 0.67x >>>>>>>>> 0% 0.30% >>>>>>>>> >>>>>>>>> In our scenario, we have two column families. The key value from >>>>>>>>> the >>>>>>>>> essential column family is used in the filter, while the key value >>>>>>>>> >>>>>>>> from >>>>> >>>>>> the >>>>>>>>> other, non essential column family is returned by the scan. Each >>>>>>>>> row >>>>>>>>> contains values for both key values, with the values being >>>>>>>>> >>>>>>>> relatively >>>> >>>>> narrow (less than 50 bytes). In this scenario, the only time we're >>>>>>>>> seeing a >>>>>>>>> performance gain is when less than 10% of the rows are selected. >>>>>>>>> >>>>>>>>> Is this a reasonable test? Has anyone else measured this? >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> >>>>>>>>> James >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >> >
