Daniel Cranford created CASSANDRA-13932:
-------------------------------------------

             Summary: Write order and seed order should be different
                 Key: CASSANDRA-13932
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13932
             Project: Cassandra
          Issue Type: Bug
          Components: Tools
            Reporter: Daniel Cranford
         Attachments: 0001-Initial-implementation-cassandra-3.11.patch, 
vmtouch-after.txt, vmtouch-before.txt

Read tests get an unrealistic boost in performance because they read data from 
a set of partitions that was written sequentially.

I ran into this while running a timed read test against a large data set (250 
million partition keys) {noformat}cassandra-stress read duration=30m{noformat} 
While the test was running, I noticed one node was performing zero IO after an 
initial period.

I discovered each node in the cluster only had blocks from a single SSTable 
loaded in the FS cache. {noformat}vmtouch -v /path/to/sstables{noformat}

For the node that was performing zero IO, the SSTable in question was small 
enough to fit into the FS cache.

I realized that when a read test is run for a duration or until rate 
convergenge, the default population for the seeds is a GAUSSIAN distribution 
over the first million seeds. Because of the way compaction works, partitions 
that are written sequentially will (with high probability) always live in the 
same SSTable. That means that while the first million seeds will generate 
partition keys that will be randomly distributed in the token space, they will 
most likely all live in the same SSTable. When this SSTable is small enough to 
fit into the FS cache, you get unbelievably good results for a read test. 
Consider that a dataset 4x the size of the FS cache will have almost 1/2 the 
data in SSTables small enough to fit into the FS cache.

Adjusting the population of seeds used during the read test to be the entire 
250 million seeds used to load the cluster does not fix the 
problem.{noformat}cassandra-stress read duration=30m -pop 
dist=gaussian(1..250M){noformat}
or (same population, larger sample) {noformat}cassandra-stress read 
n=250M{noformat}

Any distribution other than the uniform distribution has one or more modes, and 
the mode(s) of such a distribution will cluster reads around a certain seed 
range which corresponds to a certain set of sequential writes which corresponds 
to (with high probability) a single SSTable.

My patch against cassandra-3.11 fixes this by shuffling the sequence of 
generated seeds. Each seed value will still be generated once and only once. 
The old behavior of sequential seed generation (ie seed(n+1) = seed( n) + 1) 
may be selected by using the no-shuffle flag. e.g. {noformat}cassandra-stress 
read duration=30m -pop no-shuffle{noformat}

Results: In [^vmtouch-before.txt] only pages from a single SSTable are present 
in the FS cache while in [^vmtouch-after.txt] an equal proportion of all 
SSTables are present in the FS cache.





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to