Re: Essential column family performance

Ted Yu Tue, 09 Apr 2013 18:22:06 -0700

bq. with only 10000 rows that would all fit in the memstore.

This aspect should be enhanced in the test.


Cheers

On Tue, Apr 9, 2013 at 6:17 PM, Lars Hofhansl <[email protected]> wrote:

> Also the unittest tests with only 10000 rows that would all fit in the
> memstore. Seek vs reseek should make little difference for the memstore.
>
> We tested with 1m and 10m rows, and flushed the memstore  and compacted
> the store.
>
> Will do some more verification later tonight.
>
> -- Lars
>
>
> Lars H <[email protected]> wrote:
>
> >Your slow scanner performance seems to vary as well. How come? Slow is
> with the feature off.
> >
> >I don't how reseek can be slower than seek in any scenario.
> >
> >-- Lars
> >
> >Ted Yu <[email protected]> schrieb:
> >
> >>I tried using reseek() as suggested, along with my patch from HBASE-8306
> (30%
> >>selection rate, random distribution and FAST_DIFF encoding on both column
> >>families).
> >>I got uneven results:
> >>
> >>2013-04-09 16:59:01,324 INFO  [main]
> regionserver.TestJoinedScanners(167):
> >>Slow scanner finished in 7.529083 seconds, got 1546 rows
> >>
> >>2013-04-09 16:59:06,760 INFO  [main]
> regionserver.TestJoinedScanners(167):
> >>Joined scanner finished in 5.43579 seconds, got 1546 rows
> >>...
> >>2013-04-09 16:59:12,711 INFO  [main]
> regionserver.TestJoinedScanners(167):
> >>Slow scanner finished in 5.95016 seconds, got 1546 rows
> >>
> >>2013-04-09 16:59:20,240 INFO  [main]
> regionserver.TestJoinedScanners(167):
> >>Joined scanner finished in 7.529044 seconds, got 1546 rows
> >>
> >>FYI
> >>
> >>On Tue, Apr 9, 2013 at 4:47 PM, lars hofhansl <[email protected]> wrote:
> >>
> >>> We did some tests here.
> >>> I ran this through the profiler against a local RegionServer and found
> the
> >>> part that causes the slowdown is a seek called here:
> >>>              boolean mayHaveData =
> >>>               (nextJoinedKv != null &&
> >>> nextJoinedKv.matchingRow(currentRow, offset, length))
> >>>               ||
> >>> (this.joinedHeap.seek(KeyValue.createFirstOnRow(currentRow, offset,
> length))
> >>>                   && joinedHeap.peek() != null
> >>>                   && joinedHeap.peek().matchingRow(currentRow, offset,
> >>> length));
> >>>
> >>> Looking at the code, this is needed because the joinedHeap can fall
> >>> behind, and hence we have to catch it up.
> >>> The key observation, though, is that the joined heap can only ever be
> >>> behind, and hence we do not need a seek, but only a reseek.
> >>>
> >>> Deploying a RegionServer with the seek replaced with reseek we see an
> >>> improvement in *all* cases.
> >>>
> >>> I'll file a jira with a fix later.
> >>>
> >>> -- Lars
> >>>
> >>>
> >>>
> >>> ________________________________
> >>>  From: James Taylor <[email protected]>
> >>> To: [email protected]
> >>> Sent: Monday, April 8, 2013 6:53 PM
> >>> Subject: Re: Essential column family performance
> >>>
> >>> Good idea, Sergey. We'll rerun with larger non essential column family
> >>> values and see if there's a crossover point. One other difference for
> us
> >>> is that we're using FAST_DIFF encoding. We'll try with no encoding too.
> >>> Our table has 20 million rows across four regions servers.
> >>>
> >>> Regarding the parallelization we do, we run multiple scans in parallel
> >>> instead of one single scan over the table. We use the region boundaries
> >>> of the table to divide up the work evenly, adding a start/stop key for
> >>> each scan that corresponds to the region boundaries. Our client then
> >>> does a final merge/aggregation step (i.e. adding up the count it gets
> >>> back from the scan for each region).
> >>>
> >>> On 04/08/2013 01:34 PM, Sergey Shelukhin wrote:
> >>> > IntegrationTestLazyCfLoading uses randomly distributed keys with the
> >>> > following condition for filtering:
> >>> > 1 == (Long.parseLong(Bytes.toString(rowKey, 0, 4), 16) & 1); where
> rowKey
> >>> > is hex string of MD5 key.
> >>> > Then, there are 2 "lazy" CFs, each of which has a value of 4-64k.
> >>> > This test also showed significant improvement IIRC, so random
> >>> distribution
> >>> > and high %%ge of values selected should not be a problem as such.
> >>> >
> >>> > My hunch would be that the additional cost of seeks/merging the
> results
> >>> > from two CFs outweights the benefit of lazy loading on such small
> values
> >>> > for the "lazy" CF with lots of data selected. This feature definitely
> >>> makes
> >>> > no sense if you are selecting all values, because then extra work is
> >>> being
> >>> > done for no benefit (everything is read anyway).
> >>> > So the use cases would be larger "lazy" CFs or/and low percentage of
> >>> values
> >>> > selected.
> >>> >
> >>> > Can you try to increase the 2nd CF values' size and rerun the test?
> >>> >
> >>> >
> >>> > On Mon, Apr 8, 2013 at 10:38 AM, James Taylor <
> [email protected]
> >>> >wrote:
> >>> >
> >>> >> In the TestJoinedScanners.java, is the 40% randomly distributed or
> >>> >> sequential?
> >>> >>
> >>> >> In our test, the % is randomly distributed. Also, our custom filter
> does
> >>> >> the same thing that SingleColumnValueFilter does.  On the
> client-side,
> >>> we'd
> >>> >> execute the query in parallel, through multiple scans along the
> region
> >>> >> boundaries. Would that have a negative impact on performance for
> this
> >>> >> "essential column family" feature?
> >>> >>
> >>> >> Thanks,
> >>> >>
> >>> >>      James
> >>> >>
> >>> >>
> >>> >> On 04/08/2013 10:10 AM, Anoop John wrote:
> >>> >>
> >>> >>> Agree here. The effectiveness depends on what % of data satisfies
> the
> >>> >>> condition, how it is distributed across HFile blocks. We will get
> >>> >>> performance gain when the we will be able to skip some HFile blocks
> >>> (from
> >>> >>> non essential CFs). Can test with different HFile block size (lower
> >>> >>> value)?
> >>> >>>
> >>> >>> -Anoop-
> >>> >>>
> >>> >>>
> >>> >>> On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]>
> wrote:
> >>> >>>
> >>> >>>   I made the following change in TestJoinedScanners.java:
> >>> >>>> -      int flag_percent = 1;
> >>> >>>> +      int flag_percent = 40;
> >>> >>>>
> >>> >>>> The test took longer but still favors joined scanner.
> >>> >>>> I got some new results:
> >>> >>>>
> >>> >>>> 2013-04-08 07:46:06,959 INFO  [main] regionserver.**
> >>> >>>> TestJoinedScanners(157):
> >>> >>>> Slow scanner finished in 7.424388 seconds, got 2050 rows
> >>> >>>> ...
> >>> >>>> 2013-04-08 07:46:12,010 INFO  [main] regionserver.**
> >>> >>>> TestJoinedScanners(157):
> >>> >>>> Joined scanner finished in 5.05063 seconds, got 2050 rows
> >>> >>>>
> >>> >>>> 2013-04-08 07:46:18,358 INFO  [main] regionserver.**
> >>> >>>> TestJoinedScanners(157):
> >>> >>>> Slow scanner finished in 6.348517 seconds, got 2050 rows
> >>> >>>> ...
> >>> >>>> 2013-04-08 07:46:22,946 INFO  [main] regionserver.**
> >>> >>>> TestJoinedScanners(157):
> >>> >>>> Joined scanner finished in 4.587545 seconds, got 2050 rows
> >>> >>>>
> >>> >>>> Looks like effectiveness of joined scanner is affected by
> >>> distribution of
> >>> >>>> data.
> >>> >>>>
> >>> >>>> Cheers
> >>> >>>>
> >>> >>>> On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]>
> >>> wrote:
> >>> >>>>
> >>> >>>>   Looking at the joined scanner test code, it sets it up such
> that 1%
> >>> of
> >>> >>>> the
> >>> >>>>
> >>> >>>>> rows match, which would somewhat be in line with James' results.
> >>> >>>>>
> >>> >>>>> In my own testing a while ago I found a 100% improvement with 0%
> >>> match.
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> -- Lars
> >>> >>>>>
> >>> >>>>>
> >>> >>>>>
> >>> >>>>> ______________________________**__
> >>> >>>>>    From: Ted Yu <[email protected]>
> >>> >>>>> To: [email protected]
> >>> >>>>> Sent: Sunday, April 7, 2013 4:13 PM
> >>> >>>>> Subject: Re: Essential column family performance
> >>> >>>>>
> >>> >>>>> I have attached 5416-TestJoinedScanners-0.94.**txt to HBASE-5416
> for
> >>> >>>>> your
> >>> >>>>> reference.
> >>> >>>>>
> >>> >>>>> On my MacBook, I got the following results from the test:
> >>> >>>>>
> >>> >>>>> 2013-04-07 16:08:17,474 INFO  [main]
> >>> >>>>>
> >>> >>>> regionserver.**TestJoinedScanners(157):
> >>> >>>>
> >>> >>>>> Slow scanner finished in 7.973822 seconds, got 100 rows
> >>> >>>>> ...
> >>> >>>>> 2013-04-07 16:08:17,946 INFO  [main]
> >>> >>>>>
> >>> >>>> regionserver.**TestJoinedScanners(157):
> >>> >>>>
> >>> >>>>> Joined scanner finished in 0.47235 seconds, got 100 rows
> >>> >>>>>
> >>> >>>>> Cheers
> >>> >>>>>
> >>> >>>>> On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]>
> wrote:
> >>> >>>>>
> >>> >>>>>   Looking at
> >>> >>>>>>  https://issues.apache.org/**jira/secure/attachment/**
> >>> >>>> 12564340/5416-0.94-v3.txt<
> >>>
> https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
> >>> >
> >>> >>>> ,
> >>> >>>>
> >>> >>>>> I found that it didn't contain TestJoinedScanners which shows
> >>> >>>>>
> >>> >>>>>> difference in scanner performance:
> >>> >>>>>>
> >>> >>>>>>      LOG.info((slow ? "Slow" : "Joined") + " scanner finished
> in " +
> >>> >>>>>> Double.toString(timeSec)
> >>> >>>>>>
> >>> >>>>>>         + " seconds, got " + Long.toString(rows_count/2) + "
> rows");
> >>> >>>>>>
> >>> >>>>>> The test uses SingleColumnValueFilter:
> >>> >>>>>>
> >>> >>>>>>       SingleColumnValueFilter filter = new
> SingleColumnValueFilter(
> >>> >>>>>>
> >>> >>>>>>           cf_essential, col_name, CompareFilter.CompareOp.EQUAL,
> >>> >>>>>>
> >>> >>>>> flag_yes);
> >>> >>>>> It is possible that the custom filter you were using would
> exhibit
> >>> >>>>>> different access pattern compared to SingleColumnValueFilter.
> e.g.
> >>> does
> >>> >>>>>> your filter utilize hint ?
> >>> >>>>>> It would be easier for me and other people to reproduce the
> issue
> >>> you
> >>> >>>>>> experienced if you put your scenario in some test similar to
> >>> >>>>>> TestJoinedScanners.
> >>> >>>>>>
> >>> >>>>>> Will take a closer look at the code Monday.
> >>> >>>>>>
> >>> >>>>>> Cheers
> >>> >>>>>>
> >>> >>>>>> On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <
> >>> [email protected]
> >>> >>>>>> wrote:
> >>> >>>>>>
> >>> >>>>>>   Yes, on 0.94.6. We have our own custom filter derived from
> >>> FilterBase,
> >>> >>>>>> so
> >>> >>>>>> filterIfMissing isn't the issue - the results of the scan are
> >>> correct.
> >>> >>>>>>> I can see that if the essential column family has more data
> >>> compared
> >>> >>>>>>>
> >>> >>>>>> to
> >>> >>>>> the non essential column family that the results would eventually
> >>> even
> >>> >>>>>> out.
> >>> >>>>>> I was hoping to always be able to enable the essential column
> family
> >>> >>>>>>> feature. Is there an inherent reason why performance would
> degrade
> >>> >>>>>>>
> >>> >>>>>> like
> >>> >>>>> this? Does it boil down to a single sequential scan versus many
> >>> seeks?
> >>> >>>>>>> Thanks,
> >>> >>>>>>>
> >>> >>>>>>> James
> >>> >>>>>>>
> >>> >>>>>>>
> >>> >>>>>>> On 04/07/2013 07:44 AM, Ted Yu wrote:
> >>> >>>>>>>
> >>> >>>>>>>   James:
> >>> >>>>>>>> Your test was based on 0.94.6.1, right ?
> >>> >>>>>>>>
> >>> >>>>>>>> What Filter were you using ?
> >>> >>>>>>>>
> >>> >>>>>>>> If you used SingleColumnValueFilter, have you seen my comment
> >>> here ?
> >>> >>>>>>>> https://issues.apache.org/****jira/browse/HBASE-5416?**<
> >>> https://issues.apache.org/**jira/browse/HBASE-5416?**>
> >>> >>>>>>>> focusedCommentId=13541229&****page=com.atlassian.jira.**
> >>> >>>>>>>> plugin.system.issuetabpanels:****comment-tabpanel#comment-****
> >>> >>>>>>>> 13541229<
> >>> >>>>>>>>
> >>> >>>>>>> https://issues.apache.org/**jira/browse/HBASE-5416?**
> >>> >>>> focusedCommentId=13541229&**page=com.atlassian.jira.**
> >>> >>>>
> plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<
> >>>
> https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229
> >>> >
> >>> >>>>
> >>> >>>>>   BTW the use case Max Lapan tried to address has non essential
> >>> column
> >>> >>>>>>>> family
> >>> >>>>>>>> carrying considerably more data compared to essential column
> >>> family.
> >>> >>>>>>>>
> >>> >>>>>>>> Cheers
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>>
> >>> >>>>>>>> On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <
> >>> >>>>>>>>
> >>> >>>>>>> [email protected]
> >>> >>>>>   wrote:
> >>> >>>>>>>>    Hello,
> >>> >>>>>>>>
> >>> >>>>>>>>> We're doing some performance testing of the essential column
> >>> family
> >>> >>>>>>>>> feature, and we're seeing some performance degradation when
> >>> >>>>>>>>>
> >>> >>>>>>>> comparing
> >>> >>>>>   with
> >>> >>>>>>>>> and without the feature enabled:
> >>> >>>>>>>>>
> >>> >>>>>>>>>                              Performance of scan relative
> >>> >>>>>>>>> % of rows selected        to not enabling the feature
> >>> >>>>>>>>> ---------------------
>  ------------------------------******--
> >>> >>>>>>>>>
> >>> >>>>>>>>> 100%                            1.0x
> >>> >>>>>>>>>     80%                            2.0x
> >>> >>>>>>>>>     60%                            2.3x
> >>> >>>>>>>>>     40%                            2.2x
> >>> >>>>>>>>>     20%                            1.5x
> >>> >>>>>>>>>     10%                            1.0x
> >>> >>>>>>>>>      5%                            0.67x
> >>> >>>>>>>>>      0%                            0.30%
> >>> >>>>>>>>>
> >>> >>>>>>>>> In our scenario, we have two column families. The key value
> from
> >>> the
> >>> >>>>>>>>> essential column family is used in the filter, while the key
> >>> value
> >>> >>>>>>>>>
> >>> >>>>>>>> from
> >>> >>>>>>   the
> >>> >>>>>>>>> other, non essential column family is returned by the scan.
> Each
> >>> row
> >>> >>>>>>>>> contains values for both key values, with the values being
> >>> >>>>>>>>>
> >>> >>>>>>>> relatively
> >>> >>>>>   narrow (less than 50 bytes). In this scenario, the only time
> we're
> >>> >>>>>>>>> seeing a
> >>> >>>>>>>>> performance gain is when less than 10% of the rows are
> selected.
> >>> >>>>>>>>>
> >>> >>>>>>>>> Is this a reasonable test? Has anyone else measured this?
> >>> >>>>>>>>>
> >>> >>>>>>>>> Thanks,
> >>> >>>>>>>>>
> >>> >>>>>>>>> James
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>> >>>>>>>>>
> >>>
>

Re: Essential column family performance

Reply via email to