Re: Essential column family performance

James Taylor Mon, 08 Apr 2013 10:38:57 -0700

In the TestJoinedScanners.java, is the 40% randomly distributed orsequential?

In our test, the % is randomly distributed. Also, our custom filter doesthe same thing that SingleColumnValueFilter does. On the client-side,we'd execute the query in parallel, through multiple scans along theregion boundaries. Would that have a negative impact on performance forthis "essential column family" feature?


Thanks,

    James

On 04/08/2013 10:10 AM, Anoop John wrote:

Agree here. The effectiveness depends on what % of data satisfies the
condition, how it is distributed across HFile blocks. We will get
performance gain when the we will be able to skip some HFile blocks (from
non essential CFs). Can test with different HFile block size (lower value)?

-Anoop-


On Mon, Apr 8, 2013 at 8:19 PM, Ted Yu <[email protected]> wrote:

I made the following change in TestJoinedScanners.java:

-      int flag_percent = 1;
+      int flag_percent = 40;

The test took longer but still favors joined scanner.
I got some new results:

2013-04-08 07:46:06,959 INFO  [main] regionserver.TestJoinedScanners(157):
Slow scanner finished in 7.424388 seconds, got 2050 rows
...
2013-04-08 07:46:12,010 INFO  [main] regionserver.TestJoinedScanners(157):
Joined scanner finished in 5.05063 seconds, got 2050 rows

2013-04-08 07:46:18,358 INFO  [main] regionserver.TestJoinedScanners(157):
Slow scanner finished in 6.348517 seconds, got 2050 rows
...
2013-04-08 07:46:22,946 INFO  [main] regionserver.TestJoinedScanners(157):
Joined scanner finished in 4.587545 seconds, got 2050 rows

Looks like effectiveness of joined scanner is affected by distribution of
data.

Cheers

On Sun, Apr 7, 2013 at 8:52 PM, lars hofhansl <[email protected]> wrote:

Looking at the joined scanner test code, it sets it up such that 1% of

the

rows match, which would somewhat be in line with James' results.

In my own testing a while ago I found a 100% improvement with 0% match.


-- Lars



________________________________
  From: Ted Yu <[email protected]>
To: [email protected]
Sent: Sunday, April 7, 2013 4:13 PM
Subject: Re: Essential column family performance

I have attached 5416-TestJoinedScanners-0.94.txt to HBASE-5416 for your
reference.

On my MacBook, I got the following results from the test:

2013-04-07 16:08:17,474 INFO  [main]

regionserver.TestJoinedScanners(157):

Slow scanner finished in 7.973822 seconds, got 100 rows
...
2013-04-07 16:08:17,946 INFO  [main]

regionserver.TestJoinedScanners(157):

Joined scanner finished in 0.47235 seconds, got 100 rows

Cheers

On Sun, Apr 7, 2013 at 4:03 PM, Ted Yu <[email protected]> wrote:

Looking at

https://issues.apache.org/jira/secure/attachment/12564340/5416-0.94-v3.txt
,

I found that it didn't contain TestJoinedScanners which shows

difference in scanner performance:

    LOG.info((slow ? "Slow" : "Joined") + " scanner finished in " +
Double.toString(timeSec)

       + " seconds, got " + Long.toString(rows_count/2) + " rows");

The test uses SingleColumnValueFilter:

     SingleColumnValueFilter filter = new SingleColumnValueFilter(

         cf_essential, col_name, CompareFilter.CompareOp.EQUAL,

flag_yes);

It is possible that the custom filter you were using would exhibit
different access pattern compared to SingleColumnValueFilter. e.g. does
your filter utilize hint ?
It would be easier for me and other people to reproduce the issue you
experienced if you put your scenario in some test similar to
TestJoinedScanners.

Will take a closer look at the code Monday.

Cheers

On Sun, Apr 7, 2013 at 11:37 AM, James Taylor <[email protected]
wrote:

Yes, on 0.94.6. We have our own custom filter derived from FilterBase,

so

filterIfMissing isn't the issue - the results of the scan are correct.

I can see that if the essential column family has more data compared

to

the non essential column family that the results would eventually even

out.

I was hoping to always be able to enable the essential column family
feature. Is there an inherent reason why performance would degrade

like

this? Does it boil down to a single sequential scan versus many seeks?

Thanks,

James


On 04/07/2013 07:44 AM, Ted Yu wrote:

James:
Your test was based on 0.94.6.1, right ?

What Filter were you using ?

If you used SingleColumnValueFilter, have you seen my comment here ?
https://issues.apache.org/**jira/browse/HBASE-5416?**
focusedCommentId=13541229&**page=com.atlassian.jira.**
plugin.system.issuetabpanels:**comment-tabpanel#comment-**13541229<

https://issues.apache.org/jira/browse/HBASE-5416?focusedCommentId=13541229&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13541229

BTW the use case Max Lapan tried to address has non essential column
family
carrying considerably more data compared to essential column family.

Cheers



On Sat, Apr 6, 2013 at 11:05 PM, James Taylor <

[email protected]

wrote:

  Hello,

We're doing some performance testing of the essential column family
feature, and we're seeing some performance degradation when

comparing

with
and without the feature enabled:

                            Performance of scan relative
% of rows selected        to not enabling the feature
---------------------    ------------------------------****--

100%                            1.0x
   80%                            2.0x
   60%                            2.3x
   40%                            2.2x
   20%                            1.5x
   10%                            1.0x
    5%                            0.67x
    0%                            0.30%

In our scenario, we have two column families. The key value from the
essential column family is used in the filter, while the key value

from

the
other, non essential column family is returned by the scan. Each row
contains values for both key values, with the values being

relatively

narrow (less than 50 bytes). In this scenario, the only time we're
seeing a
performance gain is when less than 10% of the rows are selected.

Is this a reasonable test? Has anyone else measured this?

Thanks,

James

Re: Essential column family performance

Reply via email to