Samuel Klock created CASSANDRA-14415:
----------------------------------------

             Summary: Performance regression in queries for distinct keys
                 Key: CASSANDRA-14415
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14415
             Project: Cassandra
          Issue Type: Improvement
            Reporter: Samuel Klock
            Assignee: Samuel Klock


Running Cassandra 3.0.16, we observed a major performance regression affecting 
\{{SELECT DISTINCT keys}}-style queries against certain tables.  Based on some 
investigation (guided by some helpful feedback from Benjamin on the dev list), 
we tracked the regression down to two problems.

* One is that Cassandra was reading more data from disk than was necessary to 
satisfy the query.  This was fixed under CASSANDRA-10657 in a later 3.x release.
* If the fix for CASSANDRA-10657 is incorporated, the other is this code 
snippet in \{{RebufferingInputStream}}:
{code:java}
    @Override
    public int skipBytes(int n) throws IOException
    {
        if (n < 0)
            return 0;
        int requested = n;
        int position = buffer.position(), limit = buffer.limit(), remaining;
        while ((remaining = limit - position) < n)
        {
            n -= remaining;
            buffer.position(limit);
            reBuffer();
            position = buffer.position();
            limit = buffer.limit();
            if (position == limit)
                return requested - n;
        }
        buffer.position(position + n);
        return requested;
    }
{code}
The gist of it is that to skip bytes, the stream needs to read those bytes into 
memory then throw them away.  In our tests, we were spending a lot of time in 
this method, so it looked like the chief drag on performance.

We noticed that the subclass of \{{RebufferingInputStream}} in use for our 
queries, \{{RandomAccessReader}} (over compressed sstables), implements a 
\{{seek()}} method.  Overriding \{{skipBytes()}} in it to use \{{seek()}} 
instead was sufficient to fix the performance regression.

The performance difference is significant for tables with large values.  It's 
straightforward to evaluate with very simple key-value tables, e.g.:

{\{CREATE TABLE testtable (key TEXT PRIMARY KEY, value BLOB);}}

We did some basic experimentation with the following variations (all in a 
single-node 3.11.2 cluster with off-the-shelf settings running on a dev 
workstation):

* small values (1 KB, 100,000 entries), somewhat larger values (25 KB, 10,000 
entries), and much larger values (1 MB, 10,000 entries);
* compressible data (a single byte repeated) and uncompressible data (output 
from \{{openssl rand $bytes}}); and
* with and without sstable compression.  (With compression, we use Cassandra's 
defaults.)

The difference is most conspicuous for tables with large, uncompressible data 
and sstable decompression (which happens to describe the use case that 
triggered our investigation).  It is smaller but still readily apparent for 
tables with effective compression.  For uncompressible data without compression 
enabled, there is no appreciable difference.

Here's what the performance looks like without our patch for the 1-MB entries 
(times in seconds, five consecutive runs for each data set, all exhausting the 
results from a \{{SELECT DISTINCT key FROM ...}} query with a page size of 24):

{noformat}
working on compressible
5.21180510521
5.10270500183
5.22311806679
4.6732840538
4.84219098091
working on uncompressible_uncompressed
55.0423607826
0.769015073776
0.850513935089
0.713396072388
0.62596988678
working on uncompressible
413.292617083
231.345913887
449.524993896
425.135111094
243.469946861
{noformat}

and without the fix:

{noformat}
working on compressible
2.86733293533
1.24895811081
1.108907938
1.12742400169
1.04647302628
working on uncompressible_uncompressed
56.4146180153
0.895509958267
0.922824144363
0.772884130478
0.731923818588
working on uncompressible
64.4587619305
1.81325793266
1.52577018738
1.41769099236
1.60442209244
{noformat}

The long initial runs for the uncompressible data presumably come from 
repeatedly hitting the disk.  In contrast to the runs without the fix, the 
initial runs seem to be effective at warming the page cache (as lots of data is 
skipped, so the data that's read can fit in memory), so subsequent runs are 
faster.

For smaller data sets, \{{RandomAccessReader.seek()}} and 
\{{RebufferingInputStream.skipBytes()}} are approximately equivalent in their 
behavior (reducing to changing the position pointer of an in-memory buffer most 
of the time), so there isn't much difference.  Here's before the fix for the 
1-KB entries:

{noformat}
working on small_compressible
8.34115099907
8.57280993462
8.3534219265
8.55130696297
8.17362189293
working on small_uncompressible_uncompressed
7.85155582428
7.54075288773
7.50106596947
7.39202189445
7.95735621452
working on small_uncompressible
7.89256501198
7.88875198364
7.9013261795
7.76551413536
7.84927678108
{noformat}

and after:

{noformat}
working on small_compressible
8.29225707054
7.57822394371
8.10092878342
8.21332192421
8.19347810745
working on small_uncompressible_uncompressed
7.74823594093
7.81218004227
7.68660092354
7.95432114601
7.77612304688
working on small_uncompressible
8.18260502815
8.21010804176
8.1233921051
7.31543707848
7.91079998016
{noformat}

The effect is similar for the 25-KB entries, which might enjoy a slight 
performance benefit from the patch (perhaps because they're larger than the 
default buffer size defined in \{{RandomAccessReader}}).  Before:

{noformat}
working on medium_compressible
0.988080978394
1.02464294434
0.977658033371
1.02553391457
0.769363880157
working on medium_uncompressible_uncompressed
1.07718396187
1.08547902107
1.12398791313
1.10300898552
1.08757281303
working on medium_uncompressible
0.940990209579
0.917474985123
0.768013954163
0.871683835983
0.814841985703
{noformat}

and after:

{noformat}
working on medium_compressible
0.829009056091
0.705173015594
0.603646993637
0.820069074631
0.873830080032
working on medium_uncompressible_uncompressed
0.785156965256
0.808106184006
0.848286151886
0.857885837555
0.825689077377
working on medium_uncompressible
0.845101118088
0.913790941238
0.824147939682
0.849114894867
0.85981798172
{noformat}

In short, this looks like a pretty straightforward performance win with 
negligible cost.  (It's worth noting that for our use case, disabling sstable 
compression is clearly the _best_ solution, but there's still reasonably clear 
benefit from this minor fix for data sets with larger, compressible values, as 
well as presumably data sets with a mix of compressible and uncompressible 
values in environments where storage is limited.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to