Tatsuya Kawano created CASSANDRA-11614:
------------------------------------------

             Summary: Expired tombstones are purged before locally applied 
(Cassandra 1.2 EOL)
                 Key: CASSANDRA-11614
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11614
             Project: Cassandra
          Issue Type: Bug
          Components: Compaction
            Reporter: Tatsuya Kawano


- Found in Cassandra 1.2.19.
- Cannot reproduce in Cassandra 2.1.13.

We have several customers using Cassandra 1.2.19 via Thrift API. We understand 
1.2.x is already EOL and Thrift API is deprecated. We found this problem in 
1.2.19 and decided to fix it by ourselves and maintain our own patched version 
of Cassandra 1.2.x until all customers deployments are migrated to recent 
Cassandra versions.

I wanted to share this info with other Cassandra 1.2.x users. Also any feedback 
about the patch (shown at the bottom of message) will be welcome.


h3. Problem:

Cassandra 1.2.19 may purge expired tombstones before locally applying them. 
This problem happens when the both of the following conditions meet:

- Columns are deleted via Thrift API {{remove()}} or {{batch_mutate()}} with a 
deletion
- And, a minor compaction is performed with LazilyCompactedRow (large row 
compaction)

We use size-tired compaction strategy for some column families, and leveled 
compaction strategy for others. This problem happens in both strategies.

h3. Steps to Reproduce:

(Single node Cassandra 1.2.19)

 1. In cassandra.yaml, set in_memory_compaction_limit_in_mb to 1.
 2. Create a key space "myks" and column family "mycf" with, for example, 
SizeTiredCompactionStrategy, 
     and set gc_grace to 0 (so that tombstones will expire immediately).

{code}
cassandra-cli -h 127.0.0.1 <<EOF
  create keyspace myks
    and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy'
    and strategy_options = {replication_factor:1};

  use myks;

  create column family mycf
    with comparator = BytesType
    and default_validation_class = BytesType
    and read_repair_chance = 0.01
    and gc_grace = 0;

  describe;
  exit;
EOF
{code}

 3. Put enough columns in the same row (e.g. "myrow") to make the row size 
bigger than 1MB.
 4. Flush the CF by {{nodetool -h 127.0.0.1 flush myks mycf}} ... (SSTable A)
 5. From cassandra-cli, put a column to the same row.

{code}
cassandra-cli -h 127.0.0.1 <<EOF
  use myks;
  set mycf[utf8('myrow')][utf8('mycol')] = utf8('myvalue');
  exit;
EOF
{code}

 6. Flush the CF by {{nodetool -h 127.0.0.1 flush myks mycf}} ... (SSTable B)
 7. From cassandra-cli, delete the column from the row.

{code}
cassandra-cli -h 127.0.0.1 <<EOF
  use myks;
  del mycf[utf8('myrow')][utf8('mycol')];
  exit;
EOF
{code}

 8. Flush the CF by {{nodetool -h 127.0.0.1 flush myks mycf}} ... (SSTable C)

 9. Get the column.
    -> No column should be returned.
10. Run a user-defined compaction from JMX with SSTable A and C as the input.

e.g. With jmxterm
(Replace NNNN with the actual generation number of SSTable)

{code}
java -jar /path/to/jmxterm-1.0-alpha-4-uber.jar
$> open 127.0.0.1:7199
$> bean org.apache.cassandra.db:type=CompactionManager
$> run forceUserDefinedCompaction myks 
myks-mycf-ic-NNNN-Data.db,myks-mycf-ic-NNNN-Data.db
{code}

11. Ensure the following message is written to the system.log:
    "Compacting a large row ..."
12. Once compaction is finished, get the column again.
    -> (*expected*) The column should not be returned.
       (*actual*)   The column is returned.

h3. Cause:

I found {{row.getColumnFamily().deletionInfo().maxTimestamp()}} (where the 
{{row}} is an instance of OnDiskAtomIterator) is always set to 
{{Long.MIN_VALUE}} for non-system column families. This value is used by 
{{CompactionController#shouldPurge()}} for compacting a row with 
LazilyCompactedRow to determine if expired tombstones in a row can be purged. 
The MIN_VALUE causes {{#shouldPurge()}} almost always to return true, thus all 
expired tombstones in the row will be purged even when they have not been 
locally applied.

I do not know if this is a feature but DeletionInfo is not updated by 
{{DeleteStatement#mutationForKey()}} for single column deletion (unless it is 
an range tombstone).


h3. Workaround:

Increase the gc_grace to something big enough so that tombstones will not be 
purged.


h3. Solution:

Change LazilyCompactedRow to use {{Long.MAX_VALUE}} as maxDelTimestamp when 
calling {{#shouldPurge()}}.

{{#shouldPurge()}} considers that tombstones in a row can be purged when one of 
the following conditions meet:

1) {{maxDelTimestamp}} is smaller than the {{minTimestamp}} of overlapping 
SSTables.
2) Or, BloomFilters of overlapping SSTables indicate they do not contain the 
row that is being compacted.

Currently LazilyCompactedRow uses 
{{row.getColumnFamily().deletionInfo().maxTimestamp()}} as {{maxDelTimestamp}}, 
which causes the problem in our environment because it is always 
{{Long.MIN_VALUE}}. Instead, I will change LazilyCompactedRow to use 
{{Long.MAX_VALUE}} to disable the condition 1).

{code}
diff --git 
a/src/java/org/apache/cassandra/db/compaction/LazilyCompactedRow.java 
b/src/java/org/apache/cassandra/db/compaction/LazilyCompactedRow.java
index 433794a..0995a99 100644
--- a/src/java/org/apache/cassandra/db/compaction/LazilyCompactedRow.java
+++ b/src/java/org/apache/cassandra/db/compaction/LazilyCompactedRow.java
@@ -84,7 +84,10 @@ public class LazilyCompactedRow extends AbstractCompactedRow 
implements Iterable
             else
                 emptyColumnFamily.delete(cf);
         }
-        this.shouldPurge = controller.shouldPurge(key, maxDelTimestamp);
+
+        // Do not use maxDelTimestamp here, but Long.MAX_VALUE, because
+        // maxDelTimestamp may not be updated for some delete operations.
+        this.shouldPurge = controller.shouldPurge(key, Long.MAX_VALUE);

         try
         {
{code}

Note that I would not change the behavior of DeleteStatement#mutationForKey() 
to update DeletionInfo for single column deletion, because it may have some 
side effects to other parts of Cassandra. Also if I take this approach, I have 
to rewrite all existing SSTables with the correct value, which would not be a 
reasonable option for large scale deployment.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to