Using 30% selection rate, random distribution and FAST_DIFF encoding on both column families, I got:
2013-04-08 19:46:21,802 INFO [main] regionserver.TestJoinedScanners(166): Slow scanner finished in 5.251182 seconds, got 1547 rows ... 2013-04-08 19:46:26,661 INFO [main] regionserver.TestJoinedScanners(166): Joined scanner finished in 4.858834 seconds, got 1547 rows 2013-04-08 19:46:31,891 INFO [main] regionserver.TestJoinedScanners(166): Slow scanner finished in 5.22988 seconds, got 1547 rows ... 2013-04-08 19:46:36,566 INFO [main] regionserver.TestJoinedScanners(166): Joined scanner finished in 4.674822 seconds, got 1547 rows Cheers On Mon, Apr 8, 2013 at 6:53 PM, James Taylor <[email protected]> wrote: > Good idea, Sergey. We'll rerun with larger non essential column family > values and see if there's a crossover point. One other difference for us is > that we're using FAST_DIFF encoding. We'll try with no encoding too. Our > table has 20 million rows across four regions servers. > > Regarding the parallelization we do, we run multiple scans in parallel > instead of one single scan over the table. We use the region boundaries of > the table to divide up the work evenly, adding a start/stop key for each > scan that corresponds to the region boundaries. Our client then does a > final merge/aggregation step (i.e. adding up the count it gets back from > the scan for each region). > > > On 04/08/2013 01:34 PM, Sergey Shelukhin wrote: > >> IntegrationTestLazyCfLoading uses randomly distributed keys with the >> following condition for filtering: >> 1 == (Long.parseLong(Bytes.**toString(rowKey, 0, 4), 16) & 1); where >> rowKey >> is hex string of MD5 key. >> Then, there are 2 "lazy" CFs, each of which has a value of 4-64k. >> This test also showed significant improvement IIRC, so random distribution >> and high %%ge of values selected should not be a problem as such. >> >> My hunch would be that the additional cost of seeks/merging the results >> from two CFs outweights the benefit of lazy loading on such small values >> for the "lazy" CF with lots of data selected. This feature definitely >> makes >> no sense if you are selecting all values, because then extra work is being >> done for no benefit (everything is read anyway). >> So the use cases would be larger "lazy" CFs or/and low percentage of >> values >> selected. >> >> Can you try to increase the 2nd CF values' size and rerun the test? >> >> >> On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[email protected] >> >wrote: >> >> In the TestJoinedScanners.java, is the 40% randomly distributed or >>> sequential? >>> >>> In our test, the % is randomly distributed. Also, our custom filter does >>> the same thing that SingleColumnValueFilter does. On the client-side, >>> we'd >>> execute the query in parallel, through multiple scans along the region >>> boundaries. Would that have a negative impact on performance for this >>> "essential column family" feature? >>> >>> Thanks, >>> >>> James >>> >>> >>> On 04/08/2013 10:10 AM, Anoop John wrote: >>> >>> Agree here. The effectiveness depends on what % of data satisfies the >>>> condition, how it is distributed across HFile blocks. We will get >>>> performance gain when the we will be able to skip some HFile blocks >>>> (from >>>> non essential CFs). Can test with different HFile block size (lower >>>> value)? >>>> >>>> -Anoop- >>>> >>>> >>>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote: >>>> >>>> I made the following change in TestJoinedScanners.java: >>>> >>>>> - int flag_percent = 1; >>>>> + int flag_percent = 40; >>>>> >>>>> The test took longer but still favors joined scanner. >>>>> I got some new results: >>>>> >>>>> 2013-04-08 07:46:06,959 INFO [main] regionserver.** >>>>> TestJoinedScanners(157): >>>>> Slow scanner finished in 7.424388 seconds, got 2050 rows >>>>> ... >>>>> 2013-04-08 07:46:12,010 INFO [main] regionserver.** >>>>> TestJoinedScanners(157): >>>>> Joined scanner finished in 5.05063 seconds, got 2050 rows >>>>> >>>>> 2013-04-08 07:46:18,358 INFO [main] regionserver.** >>>>> TestJoinedScanners(157): >>>>> Slow scanner finished in 6.348517 seconds, got 2050 rows >>>>> ... >>>>> 2013-04-08 07:46:22,946 INFO [main] regionserver.** >>>>> TestJoinedScanners(157): >>>>> Joined scanner finished in 4.587545 seconds, got 2050 rows >>>>> >>>>> Looks like effectiveness of joined scanner is affected by distribution >>>>> of >>>>> data. >>>>> >>>>> Cheers >>>>> >>>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> >>>>> wrote: >>>>> >>>>> Looking at the joined scanner test code, it sets it up such that 1% >>>>> of >>>>> the >>>>> >>>>> rows match, which would somewhat be in line with James' results. >>>>>> >>>>>> In my own testing a while ago I found a 100% improvement with 0% >>>>>> match. >>>>>> >>>>>> >>>>>> -- Lars >>>>>> >>>>>> >>>>>> >>>>>> ______________________________****__ >>>>>> From: Ted Yu <[email protected]> >>>>>> To: [email protected] >>>>>> Sent: Sunday, April 7, 2013 4:13 PM >>>>>> Subject: Re: Essential column family performance >>>>>> >>>>>> I have attached 5416-TestJoinedScanners-0.94.****txt to HBASE-5416 >>>>>> for >>>>>> your >>>>>> reference. >>>>>> >>>>>> On my MacBook, I got the following results from the test: >>>>>> >>>>>> 2013-04-07 16:08:17,474 INFO [main] >>>>>> >>>>>> regionserver.****TestJoinedScanners(157): >>>>> >>>>> Slow scanner finished in 7.973822 seconds, got 100 rows >>>>>> ... >>>>>> 2013-04-07 16:08:17,946 INFO [main] >>>>>> >>>>>> regionserver.****TestJoinedScanners(157): >>>>> >>>>> Joined scanner finished in 0.47235 seconds, got 100 rows >>>>>> >>>>>> Cheers >>>>>> >>>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote: >>>>>> >>>>>> Looking at >>>>>> >>>>>>> >>>>>>> https://issues.apache.org/****jira/secure/attachment/**<https://issues.apache.org/**jira/secure/attachment/**> >>>>>>> >>>>>> 12564340/5416-0.94-v3.txt<http**s://issues.apache.org/jira/** >>>>> secure/attachment/12564340/**5416-0.94-v3.txt<https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt> >>>>> > >>>>> , >>>>> >>>>> I found that it didn't contain TestJoinedScanners which shows >>>>>> >>>>>> difference in scanner performance: >>>>>>> >>>>>>> LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " + >>>>>>> Double.toString(timeSec) >>>>>>> >>>>>>> + " seconds, got " + Long.toString(rows_count/2) + " rows"); >>>>>>> >>>>>>> The test uses SingleColumnValueFilter: >>>>>>> >>>>>>> SingleColumnValueFilter filter = new SingleColumnValueFilter( >>>>>>> >>>>>>> cf_essential, col_name, CompareFilter.CompareOp.EQUAL, >>>>>>> >>>>>>> flag_yes); >>>>>> It is possible that the custom filter you were using would exhibit >>>>>> >>>>>>> different access pattern compared to SingleColumnValueFilter. e.g. >>>>>>> does >>>>>>> your filter utilize hint ? >>>>>>> It would be easier for me and other people to reproduce the issue you >>>>>>> experienced if you put your scenario in some test similar to >>>>>>> TestJoinedScanners. >>>>>>> >>>>>>> Will take a closer look at the code Monday. >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor < >>>>>>> [email protected] >>>>>>> wrote: >>>>>>> >>>>>>> Yes, on 0.94.6. We have our own custom filter derived from >>>>>>> FilterBase, >>>>>>> so >>>>>>> filterIfMissing isn't the issue - the results of the scan are >>>>>>> correct. >>>>>>> >>>>>>>> I can see that if the essential column family has more data compared >>>>>>>> >>>>>>>> to >>>>>>> >>>>>> the non essential column family that the results would eventually even >>>>>> >>>>>>> out. >>>>>>> I was hoping to always be able to enable the essential column family >>>>>>> >>>>>>>> feature. Is there an inherent reason why performance would degrade >>>>>>>> >>>>>>>> like >>>>>>> >>>>>> this? Does it boil down to a single sequential scan versus many seeks? >>>>>> >>>>>>> Thanks, >>>>>>>> >>>>>>>> James >>>>>>>> >>>>>>>> >>>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote: >>>>>>>> >>>>>>>> James: >>>>>>>> >>>>>>>>> Your test was based on 0.94.6.1, right ? >>>>>>>>> >>>>>>>>> What Filter were you using ? >>>>>>>>> >>>>>>>>> If you used SingleColumnValueFilter, have you seen my comment here >>>>>>>>> ? >>>>>>>>> https://issues.apache.org/******jira/browse/HBASE-5416?**<https://issues.apache.org/****jira/browse/HBASE-5416?**> >>>>>>>>> <http**s://issues.apache.org/**jira/**browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**> >>>>>>>>> > >>>>>>>>> focusedCommentId=13541229&******page=com.atlassian.jira.** >>>>>>>>> plugin.system.issuetabpanels:******comment-tabpanel#comment-****** >>>>>>>>> 13541229< >>>>>>>>> >>>>>>>>> >>>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<https://issues.apache.org/**jira/browse/HBASE-5416?**> >>>>>>>> >>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.** >>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-**** >>>>> 13541229<https://issues.**apache.org/jira/browse/HBASE-** >>>>> 5416?focusedCommentId=**13541229&page=com.atlassian.** >>>>> jira.plugin.system.**issuetabpanels:comment-** >>>>> tabpanel#comment-13541229<https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229> >>>>> > >>>>> >>>>> BTW the use case Max Lapan tried to address has non essential column >>>>>> >>>>>>> family >>>>>>>>> carrying considerably more data compared to essential column >>>>>>>>> family. >>>>>>>>> >>>>>>>>> Cheers >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor < >>>>>>>>> >>>>>>>>> [email protected] >>>>>>>> >>>>>>> wrote: >>>>>> >>>>>>> Hello, >>>>>>>>> >>>>>>>>> We're doing some performance testing of the essential column >>>>>>>>>> family >>>>>>>>>> feature, and we're seeing some performance degradation when >>>>>>>>>> >>>>>>>>>> comparing >>>>>>>>> >>>>>>>> with >>>>>> >>>>>>> and without the feature enabled: >>>>>>>>>> >>>>>>>>>> Performance of scan relative >>>>>>>>>> % of rows selected to not enabling the feature >>>>>>>>>> --------------------- ------------------------------********-- >>>>>>>>>> >>>>>>>>>> 100% 1.0x >>>>>>>>>> 80% 2.0x >>>>>>>>>> 60% 2.3x >>>>>>>>>> 40% 2.2x >>>>>>>>>> 20% 1.5x >>>>>>>>>> 10% 1.0x >>>>>>>>>> 5% 0.67x >>>>>>>>>> 0% 0.30% >>>>>>>>>> >>>>>>>>>> In our scenario, we have two column families. The key value from >>>>>>>>>> the >>>>>>>>>> essential column family is used in the filter, while the key value >>>>>>>>>> >>>>>>>>>> from >>>>>>>>> >>>>>>>> the >>>>>>> >>>>>>>> other, non essential column family is returned by the scan. Each row >>>>>>>>>> contains values for both key values, with the values being >>>>>>>>>> >>>>>>>>>> relatively >>>>>>>>> >>>>>>>> narrow (less than 50 bytes). In this scenario, the only time we're >>>>>> >>>>>>> seeing a >>>>>>>>>> performance gain is when less than 10% of the rows are selected. >>>>>>>>>> >>>>>>>>>> Is this a reasonable test? Has anyone else measured this? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> James >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >
