bq. through multiple scans along the region boundaries Sorry am not able to get what you are saying. Could you elaborate on this? I think the validity of this essential CF feature is best tested in real use cases as that in Phoenix.
Regards Ram On Mon, Apr 8, 2013 at 11:12 PM, Ted Yu <[email protected]> wrote: > bq. is the 40% randomly distributed or sequential? > Looks like the distribution is striped: > > if (i % 100 <= flag_percent) { > > put.add(cf_essential, col_name, flag_yes); > In each stripe, it is sequential. > > Let me try simulating random distribution. > > On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[email protected] > >wrote: > > > In the TestJoinedScanners.java, is the 40% randomly distributed or > > sequential? > > > > In our test, the % is randomly distributed. Also, our custom filter does > > the same thing that SingleColumnValueFilter does. On the client-side, > we'd > > execute the query in parallel, through multiple scans along the region > > boundaries. Would that have a negative impact on performance for this > > "essential column family" feature? > > > > Thanks, > > > > James > > > > > > On 04/08/2013 10:10 AM, Anoop John wrote: > > > >> Agree here. The effectiveness depends on what % of data satisfies the > >> condition, how it is distributed across HFile blocks. We will get > >> performance gain when the we will be able to skip some HFile blocks > (from > >> non essential CFs). Can test with different HFile block size (lower > >> value)? > >> > >> -Anoop- > >> > >> > >> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote: > >> > >> I made the following change in TestJoinedScanners.java: > >>> > >>> - int flag_percent = 1; > >>> + int flag_percent = 40; > >>> > >>> The test took longer but still favors joined scanner. > >>> I got some new results: > >>> > >>> 2013-04-08 07:46:06,959 INFO [main] regionserver.** > >>> TestJoinedScanners(157): > >>> Slow scanner finished in 7.424388 seconds, got 2050 rows > >>> ... > >>> 2013-04-08 07:46:12,010 INFO [main] regionserver.** > >>> TestJoinedScanners(157): > >>> Joined scanner finished in 5.05063 seconds, got 2050 rows > >>> > >>> 2013-04-08 07:46:18,358 INFO [main] regionserver.** > >>> TestJoinedScanners(157): > >>> Slow scanner finished in 6.348517 seconds, got 2050 rows > >>> ... > >>> 2013-04-08 07:46:22,946 INFO [main] regionserver.** > >>> TestJoinedScanners(157): > >>> Joined scanner finished in 4.587545 seconds, got 2050 rows > >>> > >>> Looks like effectiveness of joined scanner is affected by distribution > of > >>> data. > >>> > >>> Cheers > >>> > >>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> > wrote: > >>> > >>> Looking at the joined scanner test code, it sets it up such that 1% of > >>>> > >>> the > >>> > >>>> rows match, which would somewhat be in line with James' results. > >>>> > >>>> In my own testing a while ago I found a 100% improvement with 0% > match. > >>>> > >>>> > >>>> -- Lars > >>>> > >>>> > >>>> > >>>> ______________________________**__ > >>>> From: Ted Yu <[email protected]> > >>>> To: [email protected] > >>>> Sent: Sunday, April 7, 2013 4:13 PM > >>>> Subject: Re: Essential column family performance > >>>> > >>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for > >>>> your > >>>> reference. > >>>> > >>>> On my MacBook, I got the following results from the test: > >>>> > >>>> 2013-04-07 16:08:17,474 INFO [main] > >>>> > >>> regionserver.**TestJoinedScanners(157): > >>> > >>>> Slow scanner finished in 7.973822 seconds, got 100 rows > >>>> ... > >>>> 2013-04-07 16:08:17,946 INFO [main] > >>>> > >>> regionserver.**TestJoinedScanners(157): > >>> > >>>> Joined scanner finished in 0.47235 seconds, got 100 rows > >>>> > >>>> Cheers > >>>> > >>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote: > >>>> > >>>> Looking at > >>>>> > >>>>> https://issues.apache.org/**jira/secure/attachment/** > >>> 12564340/5416-0.94-v3.txt< > https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt > > > >>> , > >>> > >>>> I found that it didn't contain TestJoinedScanners which shows > >>>> > >>>>> difference in scanner performance: > >>>>> > >>>>> LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " + > >>>>> Double.toString(timeSec) > >>>>> > >>>>> + " seconds, got " + Long.toString(rows_count/2) + " rows"); > >>>>> > >>>>> The test uses SingleColumnValueFilter: > >>>>> > >>>>> SingleColumnValueFilter filter = new SingleColumnValueFilter( > >>>>> > >>>>> cf_essential, col_name, CompareFilter.CompareOp.EQUAL, > >>>>> > >>>> flag_yes); > >>> > >>>> It is possible that the custom filter you were using would exhibit > >>>>> different access pattern compared to SingleColumnValueFilter. e.g. > does > >>>>> your filter utilize hint ? > >>>>> It would be easier for me and other people to reproduce the issue you > >>>>> experienced if you put your scenario in some test similar to > >>>>> TestJoinedScanners. > >>>>> > >>>>> Will take a closer look at the code Monday. > >>>>> > >>>>> Cheers > >>>>> > >>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor < > [email protected] > >>>>> wrote: > >>>>> > >>>>> Yes, on 0.94.6. We have our own custom filter derived from > FilterBase, > >>>>>> > >>>>> so > >>>> > >>>>> filterIfMissing isn't the issue - the results of the scan are > correct. > >>>>>> > >>>>>> I can see that if the essential column family has more data compared > >>>>>> > >>>>> to > >>> > >>>> the non essential column family that the results would eventually even > >>>>>> > >>>>> out. > >>>> > >>>>> I was hoping to always be able to enable the essential column family > >>>>>> feature. Is there an inherent reason why performance would degrade > >>>>>> > >>>>> like > >>> > >>>> this? Does it boil down to a single sequential scan versus many seeks? > >>>>>> > >>>>>> Thanks, > >>>>>> > >>>>>> James > >>>>>> > >>>>>> > >>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote: > >>>>>> > >>>>>> James: > >>>>>>> Your test was based on 0.94.6.1, right ? > >>>>>>> > >>>>>>> What Filter were you using ? > >>>>>>> > >>>>>>> If you used SingleColumnValueFilter, have you seen my comment here > ? > >>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**< > https://issues.apache.org/**jira/browse/HBASE-5416?**> > >>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.** > >>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-**** > >>>>>>> 13541229< > >>>>>>> > >>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?** > >>> focusedCommentId=13541229&**page=com.atlassian.jira.** > >>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229< > https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229 > > > >>> > >>>> BTW the use case Max Lapan tried to address has non essential column > >>>>>>> family > >>>>>>> carrying considerably more data compared to essential column > family. > >>>>>>> > >>>>>>> Cheers > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor < > >>>>>>> > >>>>>> [email protected] > >>> > >>>> wrote: > >>>>>>>> > >>>>>>> Hello, > >>>>>>> > >>>>>>>> We're doing some performance testing of the essential column > family > >>>>>>>> feature, and we're seeing some performance degradation when > >>>>>>>> > >>>>>>> comparing > >>> > >>>> with > >>>>>>>> and without the feature enabled: > >>>>>>>> > >>>>>>>> Performance of scan relative > >>>>>>>> % of rows selected to not enabling the feature > >>>>>>>> --------------------- ------------------------------******-- > >>>>>>>> > >>>>>>>> 100% 1.0x > >>>>>>>> 80% 2.0x > >>>>>>>> 60% 2.3x > >>>>>>>> 40% 2.2x > >>>>>>>> 20% 1.5x > >>>>>>>> 10% 1.0x > >>>>>>>> 5% 0.67x > >>>>>>>> 0% 0.30% > >>>>>>>> > >>>>>>>> In our scenario, we have two column families. The key value from > the > >>>>>>>> essential column family is used in the filter, while the key value > >>>>>>>> > >>>>>>> from > >>>> > >>>>> the > >>>>>>>> other, non essential column family is returned by the scan. Each > row > >>>>>>>> contains values for both key values, with the values being > >>>>>>>> > >>>>>>> relatively > >>> > >>>> narrow (less than 50 bytes). In this scenario, the only time we're > >>>>>>>> seeing a > >>>>>>>> performance gain is when less than 10% of the rows are selected. > >>>>>>>> > >>>>>>>> Is this a reasonable test? Has anyone else measured this? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> > >>>>>>>> James > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > > >
