Your slow scanner performance seems to vary as well. How come? Slow is with the feature off.
I don't how reseek can be slower than seek in any scenario. -- Lars Ted Yu <[email protected]> schrieb: >I tried using reseek() as suggested, along with my patch from HBASE-8306 (30% >selection rate, random distribution and FAST_DIFF encoding on both column >families). >I got uneven results: > >2013-04-09 16:59:01,324 INFO [main] regionserver.TestJoinedScanners(167): >Slow scanner finished in 7.529083 seconds, got 1546 rows > >2013-04-09 16:59:06,760 INFO [main] regionserver.TestJoinedScanners(167): >Joined scanner finished in 5.43579 seconds, got 1546 rows >... >2013-04-09 16:59:12,711 INFO [main] regionserver.TestJoinedScanners(167): >Slow scanner finished in 5.95016 seconds, got 1546 rows > >2013-04-09 16:59:20,240 INFO [main] regionserver.TestJoinedScanners(167): >Joined scanner finished in 7.529044 seconds, got 1546 rows > >FYI > >On Tue, Apr 9, 2013 at 4:47 PM, lars hofhansl <[email protected]> wrote: > >> We did some tests here. >> I ran this through the profiler against a local RegionServer and found the >> part that causes the slowdown is a seek called here: >> boolean mayHaveData = >> (nextJoinedKv != null && >> nextJoinedKv.matchingRow(currentRow, offset, length)) >> || >> (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset, length)) >> && joinedHeap.peek() != null >> && joinedHeap.peek().matchingRow(currentRow, offset, >> length)); >> >> Looking at the code, this is needed because the joinedHeap can fall >> behind, and hence we have to catch it up. >> The key observation, though, is that the joined heap can only ever be >> behind, and hence we do not need a seek, but only a reseek. >> >> Deploying a RegionServer with the seek replaced with reseek we see an >> improvement in *all* cases. >> >> I'll file a jira with a fix later. >> >> -- Lars >> >> >> >> ________________________________ >> From: James Taylor <[email protected]> >> To: [email protected] >> Sent: Monday, April 8, 2013 6:53 PM >> Subject: Re: Essential column family performance >> >> Good idea, Sergey. We'll rerun with larger non essential column family >> values and see if there's a crossover point. One other difference for us >> is that we're using FAST_DIFF encoding. We'll try with no encoding too. >> Our table has 20 million rows across four regions servers. >> >> Regarding the parallelization we do, we run multiple scans in parallel >> instead of one single scan over the table. We use the region boundaries >> of the table to divide up the work evenly, adding a start/stop key for >> each scan that corresponds to the region boundaries. Our client then >> does a final merge/aggregation step (i.e. adding up the count it gets >> back from the scan for each region). >> >> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote: >> > IntegrationTestLazyCfLoading uses randomly distributed keys with the >> > following condition for filtering: >> > 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where rowKey >> > is hex string of MD5 key. >> > Then, there are 2 "lazy" CFs, each of which has a value of 4-64k. >> > This test also showed significant improvement IIRC, so random >> distribution >> > and high %%ge of values selected should not be a problem as such. >> > >> > My hunch would be that the additional cost of seeks/merging the results >> > from two CFs outweights the benefit of lazy loading on such small values >> > for the "lazy" CF with lots of data selected. This feature definitely >> makes >> > no sense if you are selecting all values, because then extra work is >> being >> > done for no benefit (everything is read anyway). >> > So the use cases would be larger "lazy" CFs or/and low percentage of >> values >> > selected. >> > >> > Can you try to increase the 2nd CF values' size and rerun the test? >> > >> > >> > On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <[email protected] >> >wrote: >> > >> >> In the TestJoinedScanners.java, is the 40% randomly distributed or >> >> sequential? >> >> >> >> In our test, the % is randomly distributed. Also, our custom filter does >> >> the same thing that SingleColumnValueFilter does. On the client-side, >> we'd >> >> execute the query in parallel, through multiple scans along the region >> >> boundaries. Would that have a negative impact on performance for this >> >> "essential column family" feature? >> >> >> >> Thanks, >> >> >> >> James >> >> >> >> >> >> On 04/08/2013 10:10 AM, Anoop John wrote: >> >> >> >>> Agree here. The effectiveness depends on what % of data satisfies the >> >>> condition, how it is distributed across HFile blocks. We will get >> >>> performance gain when the we will be able to skip some HFile blocks >> (from >> >>> non essential CFs). Can test with different HFile block size (lower >> >>> value)? >> >>> >> >>> -Anoop- >> >>> >> >>> >> >>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote: >> >>> >> >>> I made the following change in TestJoinedScanners.java: >> >>>> - int flag_percent = 1; >> >>>> + int flag_percent = 40; >> >>>> >> >>>> The test took longer but still favors joined scanner. >> >>>> I got some new results: >> >>>> >> >>>> 2013-04-08 07:46:06,959 INFO [main] regionserver.** >> >>>> TestJoinedScanners(157): >> >>>> Slow scanner finished in 7.424388 seconds, got 2050 rows >> >>>> ... >> >>>> 2013-04-08 07:46:12,010 INFO [main] regionserver.** >> >>>> TestJoinedScanners(157): >> >>>> Joined scanner finished in 5.05063 seconds, got 2050 rows >> >>>> >> >>>> 2013-04-08 07:46:18,358 INFO [main] regionserver.** >> >>>> TestJoinedScanners(157): >> >>>> Slow scanner finished in 6.348517 seconds, got 2050 rows >> >>>> ... >> >>>> 2013-04-08 07:46:22,946 INFO [main] regionserver.** >> >>>> TestJoinedScanners(157): >> >>>> Joined scanner finished in 4.587545 seconds, got 2050 rows >> >>>> >> >>>> Looks like effectiveness of joined scanner is affected by >> distribution of >> >>>> data. >> >>>> >> >>>> Cheers >> >>>> >> >>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> >> wrote: >> >>>> >> >>>> Looking at the joined scanner test code, it sets it up such that 1% >> of >> >>>> the >> >>>> >> >>>>> rows match, which would somewhat be in line with James' results. >> >>>>> >> >>>>> In my own testing a while ago I found a 100% improvement with 0% >> match. >> >>>>> >> >>>>> >> >>>>> -- Lars >> >>>>> >> >>>>> >> >>>>> >> >>>>> ______________________________**__ >> >>>>> From: Ted Yu <[email protected]> >> >>>>> To: [email protected] >> >>>>> Sent: Sunday, April 7, 2013 4:13 PM >> >>>>> Subject: Re: Essential column family performance >> >>>>> >> >>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416 for >> >>>>> your >> >>>>> reference. >> >>>>> >> >>>>> On my MacBook, I got the following results from the test: >> >>>>> >> >>>>> 2013-04-07 16:08:17,474 INFO [main] >> >>>>> >> >>>> regionserver.**TestJoinedScanners(157): >> >>>> >> >>>>> Slow scanner finished in 7.973822 seconds, got 100 rows >> >>>>> ... >> >>>>> 2013-04-07 16:08:17,946 INFO [main] >> >>>>> >> >>>> regionserver.**TestJoinedScanners(157): >> >>>> >> >>>>> Joined scanner finished in 0.47235 seconds, got 100 rows >> >>>>> >> >>>>> Cheers >> >>>>> >> >>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote: >> >>>>> >> >>>>> Looking at >> >>>>>> https://issues.apache.org/**jira/secure/attachment/** >> >>>> 12564340/5416-0.94-v3.txt< >> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt >> > >> >>>> , >> >>>> >> >>>>> I found that it didn't contain TestJoinedScanners which shows >> >>>>> >> >>>>>> difference in scanner performance: >> >>>>>> >> >>>>>> LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " + >> >>>>>> Double.toString(timeSec) >> >>>>>> >> >>>>>> + " seconds, got " + Long.toString(rows_count/2) + " rows"); >> >>>>>> >> >>>>>> The test uses SingleColumnValueFilter: >> >>>>>> >> >>>>>> SingleColumnValueFilter filter = new SingleColumnValueFilter( >> >>>>>> >> >>>>>> cf_essential, col_name, CompareFilter.CompareOp.EQUAL, >> >>>>>> >> >>>>> flag_yes); >> >>>>> It is possible that the custom filter you were using would exhibit >> >>>>>> different access pattern compared to SingleColumnValueFilter. e.g. >> does >> >>>>>> your filter utilize hint ? >> >>>>>> It would be easier for me and other people to reproduce the issue >> you >> >>>>>> experienced if you put your scenario in some test similar to >> >>>>>> TestJoinedScanners. >> >>>>>> >> >>>>>> Will take a closer look at the code Monday. >> >>>>>> >> >>>>>> Cheers >> >>>>>> >> >>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor < >> [email protected] >> >>>>>> wrote: >> >>>>>> >> >>>>>> Yes, on 0.94.6. We have our own custom filter derived from >> FilterBase, >> >>>>>> so >> >>>>>> filterIfMissing isn't the issue - the results of the scan are >> correct. >> >>>>>>> I can see that if the essential column family has more data >> compared >> >>>>>>> >> >>>>>> to >> >>>>> the non essential column family that the results would eventually >> even >> >>>>>> out. >> >>>>>> I was hoping to always be able to enable the essential column family >> >>>>>>> feature. Is there an inherent reason why performance would degrade >> >>>>>>> >> >>>>>> like >> >>>>> this? Does it boil down to a single sequential scan versus many >> seeks? >> >>>>>>> Thanks, >> >>>>>>> >> >>>>>>> James >> >>>>>>> >> >>>>>>> >> >>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote: >> >>>>>>> >> >>>>>>> James: >> >>>>>>>> Your test was based on 0.94.6.1, right ? >> >>>>>>>> >> >>>>>>>> What Filter were you using ? >> >>>>>>>> >> >>>>>>>> If you used SingleColumnValueFilter, have you seen my comment >> here ? >> >>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**< >> https://issues.apache.org/**jira/browse/HBASE-5416?**> >> >>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.** >> >>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-**** >> >>>>>>>> 13541229< >> >>>>>>>> >> >>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?** >> >>>> focusedCommentId=13541229&**page=com.atlassian.jira.** >> >>>> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229< >> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229 >> > >> >>>> >> >>>>> BTW the use case Max Lapan tried to address has non essential >> column >> >>>>>>>> family >> >>>>>>>> carrying considerably more data compared to essential column >> family. >> >>>>>>>> >> >>>>>>>> Cheers >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> >> >>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor < >> >>>>>>>> >> >>>>>>> [email protected] >> >>>>> wrote: >> >>>>>>>> Hello, >> >>>>>>>> >> >>>>>>>>> We're doing some performance testing of the essential column >> family >> >>>>>>>>> feature, and we're seeing some performance degradation when >> >>>>>>>>> >> >>>>>>>> comparing >> >>>>> with >> >>>>>>>>> and without the feature enabled: >> >>>>>>>>> >> >>>>>>>>> Performance of scan relative >> >>>>>>>>> % of rows selected to not enabling the feature >> >>>>>>>>> --------------------- ------------------------------******-- >> >>>>>>>>> >> >>>>>>>>> 100% 1.0x >> >>>>>>>>> 80% 2.0x >> >>>>>>>>> 60% 2.3x >> >>>>>>>>> 40% 2.2x >> >>>>>>>>> 20% 1.5x >> >>>>>>>>> 10% 1.0x >> >>>>>>>>> 5% 0.67x >> >>>>>>>>> 0% 0.30% >> >>>>>>>>> >> >>>>>>>>> In our scenario, we have two column families. The key value from >> the >> >>>>>>>>> essential column family is used in the filter, while the key >> value >> >>>>>>>>> >> >>>>>>>> from >> >>>>>> the >> >>>>>>>>> other, non essential column family is returned by the scan. Each >> row >> >>>>>>>>> contains values for both key values, with the values being >> >>>>>>>>> >> >>>>>>>> relatively >> >>>>> narrow (less than 50 bytes). In this scenario, the only time we're >> >>>>>>>>> seeing a >> >>>>>>>>> performance gain is when less than 10% of the rows are selected. >> >>>>>>>>> >> >>>>>>>>> Is this a reasonable test? Has anyone else measured this? >> >>>>>>>>> >> >>>>>>>>> Thanks, >> >>>>>>>>> >> >>>>>>>>> James >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >> >>>>>>>>> >>
