Dan Kinder created CASSANDRA-8961:
-------------------------------------

             Summary: Data rewrite case causes almost non-functional compaction
                 Key: CASSANDRA-8961
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8961
             Project: Cassandra
          Issue Type: Bug
         Environment: Centos 6.6, Cassandra 2.0.12 (Also seen in Cassandra 2.1)
            Reporter: Dan Kinder
            Priority: Minor


There seems to be a bug of some kind where compaction grinds to a halt in this 
use case: from time to time we have a set of rows we need to "migrate", 
changing their primary key by deleting the row and inserting a new row with the 
same partition key and different cluster key. The python script below 
demonstrates this; it takes a bit of time to run (didn't try to optimize it) 
but when it's done it will be trying to compact a few hundred megs of data for 
a long time... on the order of days, or it will never finish.

Not verified by this sandboxed experiment but it seems that compression 
settings do not matter and that this seems to happen to STCS as well, not just 
LCS. I am still testing if other patterns cause this terrible compaction 
performance, like deleting all rows then inserting or vice versa.

Even if it isn't a "bug" per se, is there a way to fix or work around this 
behavior?

{code}
import string
import random
from cassandra.cluster import Cluster

cluster = Cluster(['localhost'])
db = cluster.connect('walker')

db.execute("DROP KEYSPACE IF EXISTS trial")
db.execute("""CREATE KEYSPACE trial
              WITH REPLICATION = { 'class': 'SimpleStrategy', 
'replication_factor': 1 }""")
db.execute("""CREATE TABLE trial.tbl (
                pk text,
                data text,
                PRIMARY KEY(pk, data)
              ) WITH compaction = { 'class' : 'LeveledCompactionStrategy' }
                AND compression = {'sstable_compression': ''}""")

# Number of rows to insert and "move"
n = 200000                                                                  
                                                                            
# Insert n rows with the same partition key, 1KB of unique data in cluster key
for i in range(n):
    db.execute("INSERT INTO trial.tbl (pk, data) VALUES ('thepk', %s)",
        [str(i).zfill(1024)])

# Update those n rows, deleting each and replacing with a very similar row
for i in range(n):
    val = str(i).zfill(1024)
    db.execute("DELETE FROM trial.tbl WHERE pk = 'thepk' AND data = %s", [val])
    db.execute("INSERT INTO trial.tbl (pk, data) VALUES ('thepk', %s)", ["1" + 
val])
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to