bq. with only 10000 rows that would all fit in the memstore. This aspect should be enhanced in the test.
Cheers On Tue, Apr 9, 2013 at 6:17 PM, Lars Hofhansl <[email protected]> wrote: > Also the unittest tests with only 10000 rows that would all fit in the > memstore. Seek vs reseek should make little difference for the memstore. > > We tested with 1m and 10m rows, and flushed the memstore and compacted > the store. > > Will do some more verification later tonight. > > -- Lars > > > Lars H <[email protected]> wrote: > > >Your slow scanner performance seems to vary as well. How come? Slow is > with the feature off. > > > >I don't how reseek can be slower than seek in any scenario. > > > >-- Lars > > > >Ted Yu <[email protected]> schrieb: > > > >>I tried using reseek() as suggested, along with my patch from HBASE-8306 > (30% > >>selection rate, random distribution and FAST_DIFF encoding on both column > >>families). > >>I got uneven results: > >> > >>2013-04-09 16:59:01,324 INFO [main] > regionserver.TestJoinedScanners(167): > >>Slow scanner finished in 7.529083 seconds, got 1546 rows > >> > >>2013-04-09 16:59:06,760 INFO [main] > regionserver.TestJoinedScanners(167): > >>Joined scanner finished in 5.43579 seconds, got 1546 rows > >>... > >>2013-04-09 16:59:12,711 INFO [main] > regionserver.TestJoinedScanners(167): > >>Slow scanner finished in 5.95016 seconds, got 1546 rows > >> > >>2013-04-09 16:59:20,240 INFO [main] > regionserver.TestJoinedScanners(167): > >>Joined scanner finished in 7.529044 seconds, got 1546 rows > >> > >>FYI > >> > >>On Tue, Apr 9, 2013 at 4:47 PM, lars hofhansl <[email protected]> wrote: > >> > >>> We did some tests here. > >>> I ran this through the profiler against a local RegionServer and found > the > >>> part that causes the slowdown is a seek called here: > >>> boolean mayHaveData = > >>> (nextJoinedKv != null && > >>> nextJoinedKv.matchingRow(currentRow, offset, length)) > >>> || > >>> (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset, > length)) > >>> && joinedHeap.peek() != null > >>> && joinedHeap.peek().matchingRow(currentRow, offset, > >>> length)); > >>> > >>> Looking at the code, this is needed because the joinedHeap can fall > >>> behind, and hence we have to catch it up. > >>> The key observation, though, is that the joined heap can only ever be > >>> behind, and hence we do not need a seek, but only a reseek. > >>> > >>> Deploying a RegionServer with the seek replaced with reseek we see an > >>> improvement in *all* cases. > >>> > >>> I'll file a jira with a fix later. > >>> > >>> -- Lars > >>> > >>> > >>> > >>> ________________________________ > >>> From: James Taylor <[email protected]> > >>> To: [email protected] > >>> Sent: Monday, April 8, 2013 6:53 PM > >>> Subject: Re: Essential column family performance > >>> > >>> Good idea, Sergey. We'll rerun with larger non essential column family > >>> values and see if there's a crossover point. One other difference for > us > >>> is that we're using FAST_DIFF encoding. We'll try with no encoding too. > >>> Our table has 20 million rows across four regions servers. > >>> > >>> Regarding the parallelization we do, we run multiple scans in parallel > >>> instead of one single scan over the table. We use the region boundaries > >>> of the table to divide up the work evenly, adding a start/stop key for > >>> each scan that corresponds to the region boundaries. Our client then > >>> does a final merge/aggregation step (i.e. adding up the count it gets > >>> back from the scan for each region). > >>> > >>> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote: > >>> > IntegrationTestLazyCfLoading uses randomly distributed keys with the > >>> > following condition for filtering: > >>> > 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where > rowKey > >>> > is hex string of MD5 key. > >>> > Then, there are 2 "lazy" CFs, each of which has a value of 4-64k. > >>> > This test also showed significant improvement IIRC, so random > >>> distribution > >>> > and high %%ge of values selected should not be a problem as such. > >>> > > >>> > My hunch would be that the additional cost of seeks/merging the > results > >>> > from two CFs outweights the benefit of lazy loading on such small > values > >>> > for the "lazy" CF with lots of data selected. This feature definitely > >>> makes > >>> > no sense if you are selecting all values, because then extra work is > >>> being > >>> > done for no benefit (everything is read anyway). > >>> > So the use cases would be larger "lazy" CFs or/and low percentage of > >>> values > >>> > selected. > >>> > > >>> > Can you try to increase the 2nd CF values' size and rerun the test? > >>> > > >>> > > >>> > On Mon, Apr 8, 2013 at 10:38 AM, James Taylor < > [email protected] > >>> >wrote: > >>> > > >>> >> In the TestJoinedScanners.java, is the 40% randomly distributed or > >>> >> sequential? > >>> >> > >>> >> In our test, the % is randomly distributed. Also, our custom filter > does > >>> >> the same thing that SingleColumnValueFilter does. On the > client-side, > >>> we'd > >>> >> execute the query in parallel, through multiple scans along the > region > >>> >> boundaries. Would that have a negative impact on performance for > this > >>> >> "essential column family" feature? > >>> >> > >>> >> Thanks, > >>> >> > >>> >> James > >>> >> > >>> >> > >>> >> On 04/08/2013 10:10 AM, Anoop John wrote: > >>> >> > >>> >>> Agree here. The effectiveness depends on what % of data satisfies > the > >>> >>> condition, how it is distributed across HFile blocks. We will get > >>> >>> performance gain when the we will be able to skip some HFile blocks > >>> (from > >>> >>> non essential CFs). Can test with different HFile block size (lower > >>> >>> value)? > >>> >>> > >>> >>> -Anoop- > >>> >>> > >>> >>> > >>> >>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> > wrote: > >>> >>> > >>> >>> I made the following change in TestJoinedScanners.java: > >>> >>>> - int flag_percent = 1; > >>> >>>> + int flag_percent = 40; > >>> >>>> > >>> >>>> The test took longer but still favors joined scanner. > >>> >>>> I got some new results: > >>> >>>> > >>> >>>> 2013-04-08 07:46:06,959 INFO [main] regionserver.** > >>> >>>> TestJoinedScanners(157): > >>> >>>> Slow scanner finished in 7.424388 seconds, got 2050 rows > >>> >>>> ... > >>> >>>> 2013-04-08 07:46:12,010 INFO [main] regionserver.** > >>> >>>> TestJoinedScanners(157): > >>> >>>> Joined scanner finished in 5.05063 seconds, got 2050 rows > >>> >>>> > >>> >>>> 2013-04-08 07:46:18,358 INFO [main] regionserver.** > >>> >>>> TestJoinedScanners(157): > >>> >>>> Slow scanner finished in 6.348517 seconds, got 2050 rows > >>> >>>> ... > >>> >>>> 2013-04-08 07:46:22,946 INFO [main] regionserver.** > >>> >>>> TestJoinedScanners(157): > >>> >>>> Joined scanner finished in 4.587545 seconds, got 2050 rows > >>> >>>> > >>> >>>> Looks like effectiveness of joined scanner is affected by > >>> distribution of > >>> >>>> data. > >>> >>>> > >>> >>>> Cheers > >>> >>>> > >>> >>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> > >>> wrote: > >>> >>>> > >>> >>>> Looking at the joined scanner test code, it sets it up such > that 1% > >>> of > >>> >>>> the > >>> >>>> > >>> >>>>> rows match, which would somewhat be in line with James' results. > >>> >>>>> > >>> >>>>> In my own testing a while ago I found a 100% improvement with 0% > >>> match. > >>> >>>>> > >>> >>>>> > >>> >>>>> -- Lars > >>> >>>>> > >>> >>>>> > >>> >>>>> > >>> >>>>> ______________________________**__ > >>> >>>>> From: Ted Yu <[email protected]> > >>> >>>>> To: [email protected] > >>> >>>>> Sent: Sunday, April 7, 2013 4:13 PM > >>> >>>>> Subject: Re: Essential column family performance > >>> >>>>> > >>> >>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 > for > >>> >>>>> your > >>> >>>>> reference. > >>> >>>>> > >>> >>>>> On my MacBook, I got the following results from the test: > >>> >>>>> > >>> >>>>> 2013-04-07 16:08:17,474 INFO [main] > >>> >>>>> > >>> >>>> regionserver.**TestJoinedScanners(157): > >>> >>>> > >>> >>>>> Slow scanner finished in 7.973822 seconds, got 100 rows > >>> >>>>> ... > >>> >>>>> 2013-04-07 16:08:17,946 INFO [main] > >>> >>>>> > >>> >>>> regionserver.**TestJoinedScanners(157): > >>> >>>> > >>> >>>>> Joined scanner finished in 0.47235 seconds, got 100 rows > >>> >>>>> > >>> >>>>> Cheers > >>> >>>>> > >>> >>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> > wrote: > >>> >>>>> > >>> >>>>> Looking at > >>> >>>>>> https://issues.apache.org/**jira/secure/attachment/** > >>> >>>> 12564340/5416-0.94-v3.txt< > >>> > https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt > >>> > > >>> >>>> , > >>> >>>> > >>> >>>>> I found that it didn't contain TestJoinedScanners which shows > >>> >>>>> > >>> >>>>>> difference in scanner performance: > >>> >>>>>> > >>> >>>>>> LOG.info((slow ? "Slow" : "Joined") + " scanner finished > in " + > >>> >>>>>> Double.toString(timeSec) > >>> >>>>>> > >>> >>>>>> + " seconds, got " + Long.toString(rows_count/2) + " > rows"); > >>> >>>>>> > >>> >>>>>> The test uses SingleColumnValueFilter: > >>> >>>>>> > >>> >>>>>> SingleColumnValueFilter filter = new > SingleColumnValueFilter( > >>> >>>>>> > >>> >>>>>> cf_essential, col_name, CompareFilter.CompareOp.EQUAL, > >>> >>>>>> > >>> >>>>> flag_yes); > >>> >>>>> It is possible that the custom filter you were using would > exhibit > >>> >>>>>> different access pattern compared to SingleColumnValueFilter. > e.g. > >>> does > >>> >>>>>> your filter utilize hint ? > >>> >>>>>> It would be easier for me and other people to reproduce the > issue > >>> you > >>> >>>>>> experienced if you put your scenario in some test similar to > >>> >>>>>> TestJoinedScanners. > >>> >>>>>> > >>> >>>>>> Will take a closer look at the code Monday. > >>> >>>>>> > >>> >>>>>> Cheers > >>> >>>>>> > >>> >>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor < > >>> [email protected] > >>> >>>>>> wrote: > >>> >>>>>> > >>> >>>>>> Yes, on 0.94.6. We have our own custom filter derived from > >>> FilterBase, > >>> >>>>>> so > >>> >>>>>> filterIfMissing isn't the issue - the results of the scan are > >>> correct. > >>> >>>>>>> I can see that if the essential column family has more data > >>> compared > >>> >>>>>>> > >>> >>>>>> to > >>> >>>>> the non essential column family that the results would eventually > >>> even > >>> >>>>>> out. > >>> >>>>>> I was hoping to always be able to enable the essential column > family > >>> >>>>>>> feature. Is there an inherent reason why performance would > degrade > >>> >>>>>>> > >>> >>>>>> like > >>> >>>>> this? Does it boil down to a single sequential scan versus many > >>> seeks? > >>> >>>>>>> Thanks, > >>> >>>>>>> > >>> >>>>>>> James > >>> >>>>>>> > >>> >>>>>>> > >>> >>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote: > >>> >>>>>>> > >>> >>>>>>> James: > >>> >>>>>>>> Your test was based on 0.94.6.1, right ? > >>> >>>>>>>> > >>> >>>>>>>> What Filter were you using ? > >>> >>>>>>>> > >>> >>>>>>>> If you used SingleColumnValueFilter, have you seen my comment > >>> here ? > >>> >>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**< > >>> https://issues.apache.org/**jira/browse/HBASE-5416?**> > >>> >>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.** > >>> >>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-**** > >>> >>>>>>>> 13541229< > >>> >>>>>>>> > >>> >>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?** > >>> >>>> focusedCommentId=13541229&**page=com.atlassian.jira.** > >>> >>>> > plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229< > >>> > https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229 > >>> > > >>> >>>> > >>> >>>>> BTW the use case Max Lapan tried to address has non essential > >>> column > >>> >>>>>>>> family > >>> >>>>>>>> carrying considerably more data compared to essential column > >>> family. > >>> >>>>>>>> > >>> >>>>>>>> Cheers > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> > >>> >>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor < > >>> >>>>>>>> > >>> >>>>>>> [email protected] > >>> >>>>> wrote: > >>> >>>>>>>> Hello, > >>> >>>>>>>> > >>> >>>>>>>>> We're doing some performance testing of the essential column > >>> family > >>> >>>>>>>>> feature, and we're seeing some performance degradation when > >>> >>>>>>>>> > >>> >>>>>>>> comparing > >>> >>>>> with > >>> >>>>>>>>> and without the feature enabled: > >>> >>>>>>>>> > >>> >>>>>>>>> Performance of scan relative > >>> >>>>>>>>> % of rows selected to not enabling the feature > >>> >>>>>>>>> --------------------- > ------------------------------******-- > >>> >>>>>>>>> > >>> >>>>>>>>> 100% 1.0x > >>> >>>>>>>>> 80% 2.0x > >>> >>>>>>>>> 60% 2.3x > >>> >>>>>>>>> 40% 2.2x > >>> >>>>>>>>> 20% 1.5x > >>> >>>>>>>>> 10% 1.0x > >>> >>>>>>>>> 5% 0.67x > >>> >>>>>>>>> 0% 0.30% > >>> >>>>>>>>> > >>> >>>>>>>>> In our scenario, we have two column families. The key value > from > >>> the > >>> >>>>>>>>> essential column family is used in the filter, while the key > >>> value > >>> >>>>>>>>> > >>> >>>>>>>> from > >>> >>>>>> the > >>> >>>>>>>>> other, non essential column family is returned by the scan. > Each > >>> row > >>> >>>>>>>>> contains values for both key values, with the values being > >>> >>>>>>>>> > >>> >>>>>>>> relatively > >>> >>>>> narrow (less than 50 bytes). In this scenario, the only time > we're > >>> >>>>>>>>> seeing a > >>> >>>>>>>>> performance gain is when less than 10% of the rows are > selected. > >>> >>>>>>>>> > >>> >>>>>>>>> Is this a reasonable test? Has anyone else measured this? > >>> >>>>>>>>> > >>> >>>>>>>>> Thanks, > >>> >>>>>>>>> > >>> >>>>>>>>> James > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >>>>>>>>> > >>> >
