Tatsuya Kawano created CASSANDRA-11614: ------------------------------------------
Summary: Expired tombstones are purged before locally applied (Cassandra 1.2 EOL) Key: CASSANDRA-11614 URL: https://issues.apache.org/jira/browse/CASSANDRA-11614 Project: Cassandra Issue Type: Bug Components: Compaction Reporter: Tatsuya Kawano - Found in Cassandra 1.2.19. - Cannot reproduce in Cassandra 2.1.13. We have several customers using Cassandra 1.2.19 via Thrift API. We understand 1.2.x is already EOL and Thrift API is deprecated. We found this problem in 1.2.19 and decided to fix it by ourselves and maintain our own patched version of Cassandra 1.2.x until all customers deployments are migrated to recent Cassandra versions. I wanted to share this info with other Cassandra 1.2.x users. Also any feedback about the patch (shown at the bottom of message) will be welcome. h3. Problem: Cassandra 1.2.19 may purge expired tombstones before locally applying them. This problem happens when the both of the following conditions meet: - Columns are deleted via Thrift API {{remove()}} or {{batch_mutate()}} with a deletion - And, a minor compaction is performed with LazilyCompactedRow (large row compaction) We use size-tired compaction strategy for some column families, and leveled compaction strategy for others. This problem happens in both strategies. h3. Steps to Reproduce: (Single node Cassandra 1.2.19) 1. In cassandra.yaml, set in_memory_compaction_limit_in_mb to 1. 2. Create a key space "myks" and column family "mycf" with, for example, SizeTiredCompactionStrategy, and set gc_grace to 0 (so that tombstones will expire immediately). {code} cassandra-cli -h 127.0.0.1 <<EOF create keyspace myks and placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; use myks; create column family mycf with comparator = BytesType and default_validation_class = BytesType and read_repair_chance = 0.01 and gc_grace = 0; describe; exit; EOF {code} 3. Put enough columns in the same row (e.g. "myrow") to make the row size bigger than 1MB. 4. Flush the CF by {{nodetool -h 127.0.0.1 flush myks mycf}} ... (SSTable A) 5. From cassandra-cli, put a column to the same row. {code} cassandra-cli -h 127.0.0.1 <<EOF use myks; set mycf[utf8('myrow')][utf8('mycol')] = utf8('myvalue'); exit; EOF {code} 6. Flush the CF by {{nodetool -h 127.0.0.1 flush myks mycf}} ... (SSTable B) 7. From cassandra-cli, delete the column from the row. {code} cassandra-cli -h 127.0.0.1 <<EOF use myks; del mycf[utf8('myrow')][utf8('mycol')]; exit; EOF {code} 8. Flush the CF by {{nodetool -h 127.0.0.1 flush myks mycf}} ... (SSTable C) 9. Get the column. -> No column should be returned. 10. Run a user-defined compaction from JMX with SSTable A and C as the input. e.g. With jmxterm (Replace NNNN with the actual generation number of SSTable) {code} java -jar /path/to/jmxterm-1.0-alpha-4-uber.jar $> open 127.0.0.1:7199 $> bean org.apache.cassandra.db:type=CompactionManager $> run forceUserDefinedCompaction myks myks-mycf-ic-NNNN-Data.db,myks-mycf-ic-NNNN-Data.db {code} 11. Ensure the following message is written to the system.log: "Compacting a large row ..." 12. Once compaction is finished, get the column again. -> (*expected*) The column should not be returned. (*actual*) The column is returned. h3. Cause: I found {{row.getColumnFamily().deletionInfo().maxTimestamp()}} (where the {{row}} is an instance of OnDiskAtomIterator) is always set to {{Long.MIN_VALUE}} for non-system column families. This value is used by {{CompactionController#shouldPurge()}} for compacting a row with LazilyCompactedRow to determine if expired tombstones in a row can be purged. The MIN_VALUE causes {{#shouldPurge()}} almost always to return true, thus all expired tombstones in the row will be purged even when they have not been locally applied. I do not know if this is a feature but DeletionInfo is not updated by {{DeleteStatement#mutationForKey()}} for single column deletion (unless it is an range tombstone). h3. Workaround: Increase the gc_grace to something big enough so that tombstones will not be purged. h3. Solution: Change LazilyCompactedRow to use {{Long.MAX_VALUE}} as maxDelTimestamp when calling {{#shouldPurge()}}. {{#shouldPurge()}} considers that tombstones in a row can be purged when one of the following conditions meet: 1) {{maxDelTimestamp}} is smaller than the {{minTimestamp}} of overlapping SSTables. 2) Or, BloomFilters of overlapping SSTables indicate they do not contain the row that is being compacted. Currently LazilyCompactedRow uses {{row.getColumnFamily().deletionInfo().maxTimestamp()}} as {{maxDelTimestamp}}, which causes the problem in our environment because it is always {{Long.MIN_VALUE}}. Instead, I will change LazilyCompactedRow to use {{Long.MAX_VALUE}} to disable the condition 1). {code} diff --git a/src/java/org/apache/cassandra/db/compaction/LazilyCompactedRow.java b/src/java/org/apache/cassandra/db/compaction/LazilyCompactedRow.java index 433794a..0995a99 100644 --- a/src/java/org/apache/cassandra/db/compaction/LazilyCompactedRow.java +++ b/src/java/org/apache/cassandra/db/compaction/LazilyCompactedRow.java @@ -84,7 +84,10 @@ public class LazilyCompactedRow extends AbstractCompactedRow implements Iterable else emptyColumnFamily.delete(cf); } - this.shouldPurge = controller.shouldPurge(key, maxDelTimestamp); + + // Do not use maxDelTimestamp here, but Long.MAX_VALUE, because + // maxDelTimestamp may not be updated for some delete operations. + this.shouldPurge = controller.shouldPurge(key, Long.MAX_VALUE); try { {code} Note that I would not change the behavior of DeleteStatement#mutationForKey() to update DeletionInfo for single column deletion, because it may have some side effects to other parts of Cassandra. Also if I take this approach, I have to rewrite all existing SSTables with the correct value, which would not be a reasonable option for large scale deployment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)