Benjamin Roth created CASSANDRA-12991: -----------------------------------------
Summary: Inter-node race condition in validation compaction Key: CASSANDRA-12991 URL: https://issues.apache.org/jira/browse/CASSANDRA-12991 Project: Cassandra Issue Type: Improvement Reporter: Benjamin Roth Priority: Minor Problem: When a validation compaction is triggered by a repair it may happen that due to flying in mutations the merkle trees differ but the data is not consistent. Example: t = 10000: Repair starts validation Node A starts validation t = 10001: Mutation arrives at Node A t = 10002: Mutation arrives at Node B t = 10003: Node B starts validation Hashes of node A+B will differ but data is consistent from a view (think of it like a snapshot) t = 10000. Impact: Unnecessary streaming happens. This may not a big impact on low traffic CFs, partitions but on high traffic CFs and maybe very big partitions, this may have a bigger impact and is a waste of resources. Possible solution: Build hashes based upon a snapshot timestamp. This requires SSTables created after that timestamp to be filtered when doing a validation compaction: - Cells with timestamp > snapshot time have to be removed - Tombstone range markers have to be handled - Bounds have to be removed if delete timestamp > snapshot time - Boundary markers have to be either changed to a bound or completely removed, depending if start and/or end are both affected or not Probably this is a known behaviour. Have there been any discussions about this in the past? Did not find an matching issue, so I created this one. I am happy about any feedback, whatsoever. -- This message was sent by Atlassian JIRA (v6.3.4#6332)