[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226544#comment-17226544 ] Marcus Eriksson commented on CASSANDRA-15789: - [~leonz] thanks for letting us known, fixed in https://github.com/apache/cassandra/commit/fa9bbd431100ceac0af8ca3ea0a3dac407246446 and merged up to 3.11 and trunk > Rows can get duplicated in mixed major-version clusters and after full upgrade > -- > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable >Reporter: Aleksey Yeschenko >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.21, 3.11.7, 4.0, 4.0-beta1 > > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ++ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} > Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[ > [distributed_test_keyspace.tbl] key=1 > partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 > columns=[[] | [v]] > Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, > localDeletion=1588588821, [v[c]=d ts=3] > Row[info=[ts=-9223372036854775808] del=deletedAt=1, > localDeletion=1588588821 ]: ck=2 | > Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, > localDeletion=1588588821 > ]) > {code} > {{LegacyLayout}} correctly interprets a range tombstone whose start and > finish {{collectionName}} values don't match as a wrapping fragment of a > legacy row deletion that's being interrupted by a collection deletion > (correctly) - see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. > Quoting the comment inline: > {code} > // Because of the way RangeTombstoneList work, we can have a tombstone where > only one of > // the bound has a collectionName. That happens if we have a big tombstone A > (spanning one > // or multiple rows) and a collection tombstone B. In that case, > RangeTombstoneList will > // split this into 3 RTs: the first one from the beginning of A to the > beginning of B, > // then B, then a third one from the end of B to the end of A. To make this > simpler, if > // we detect that case we transform the 1st and 3rd tombstone so they don't > end in the middle > // of a row (which is still correct). > {code} > {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such > a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, > and mistakenly starts a new row, while in the middle of an existing one: (see >
[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17226467#comment-17226467 ] Leon Zaruvinsky commented on CASSANDRA-15789: - Hey all - just wanted to call out that an errant import made it into the Cassandra 3.11 patch of this PR: https://github.com/apache/cassandra/blob/4d42c189fa82b32fd93ae42a164b91e4db62992e/src/java/org/apache/cassandra/service/StorageProxyMBean.java#L24 Not a big deal, but noticed this because we publish the MBeans/jmx api as their own custom-built package and hit a compilation error because the DatabaseDescriptor wasn't included. > Rows can get duplicated in mixed major-version clusters and after full upgrade > -- > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable >Reporter: Aleksey Yeschenko >Assignee: Marcus Eriksson >Priority: Normal > Fix For: 3.0.21, 3.11.7, 4.0, 4.0-beta1 > > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ++ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} > Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[ > [distributed_test_keyspace.tbl] key=1 > partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 > columns=[[] | [v]] > Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, > localDeletion=1588588821, [v[c]=d ts=3] > Row[info=[ts=-9223372036854775808] del=deletedAt=1, > localDeletion=1588588821 ]: ck=2 | > Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, > localDeletion=1588588821 > ]) > {code} > {{LegacyLayout}} correctly interprets a range tombstone whose start and > finish {{collectionName}} values don't match as a wrapping fragment of a > legacy row deletion that's being interrupted by a collection deletion > (correctly) - see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. > Quoting the comment inline: > {code} > // Because of the way RangeTombstoneList work, we can have a tombstone where > only one of > // the bound has a collectionName. That happens if we have a big tombstone A > (spanning one > // or multiple rows) and a collection tombstone B. In that case, > RangeTombstoneList will > // split this into 3 RTs: the first one from the beginning of A to the > beginning of B, > // then B, then a third one from the end of B to the end of A. To make this > simpler, if > // we detect that case we transform the 1st and 3rd tombstone so they don't > end in the middle > // of a row (which is still correct). >
[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111052#comment-17111052 ] Sam Tunnicliffe commented on CASSANDRA-15789: - {quote} * In this patch, we're using an [executor|https://github.com/apache/cassandra/compare/trunk...krummas:15789-3.11#diff-32fe9b86f85fea958f137ab7862ec522R42] that doesn't get shut down. Should we use use non-periodic tasks for them?{quote} This is to be explicit about making the snapshot task execution single threaded to ensure that only a single snapshot per-prefix can be triggered on a replica. Non-periodic tasks should be, and most likely always is, effectively singlethreaded but it doesn't explicitly guarantee that. {quote} * we're setting snapshot_on_duplicate_row_detection via config and diagnostic_snapshot_interval_nanos via system property. I don't mind to have it as-is in current case, but we should generally try to consolidate the way we're managing configuration. {quote} {{diagnostic_snapshot_interval_nanos}} is purely for testing, so it didn't feel necessary to make that accessible to operators. We could subclass {{DiagnosticSnapshotService}} for testing instead, but it didn't seem too hacky to use a system property here. > Rows can get duplicated in mixed major-version clusters and after full upgrade > -- > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable >Reporter: Aleksey Yeschenko >Assignee: Marcus Eriksson >Priority: Normal > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ++ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} > Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[ > [distributed_test_keyspace.tbl] key=1 > partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 > columns=[[] | [v]] > Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, > localDeletion=1588588821, [v[c]=d ts=3] > Row[info=[ts=-9223372036854775808] del=deletedAt=1, > localDeletion=1588588821 ]: ck=2 | > Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, > localDeletion=1588588821 > ]) > {code} > {{LegacyLayout}} correctly interprets a range tombstone whose start and > finish {{collectionName}} values don't match as a wrapping fragment of a > legacy row deletion that's being interrupted by a collection deletion > (correctly) - see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. > Quoting the comment inline: > {code} > // Because of the way
[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17111027#comment-17111027 ] Alex Petrov commented on CASSANDRA-15789: - +1 from me as well. Two tiny nits that can be fixed on commit: * two getters ({{getCheckForDuplicateRowsDuringReads}} and {{getCheckForDuplicateRowsDuringCompaction}}) return {{void}} * [toIter|https://github.com/apache/cassandra/compare/trunk...krummas:15789-3.11#diff-c43c377976893dc7ae62e89072946ecbR141] can be replaced by {{Iterators#forArray()}} I have some questions / meta-discussions: * In this patch, we're using an [executor|https://github.com/apache/cassandra/compare/trunk...krummas:15789-3.11#diff-32fe9b86f85fea958f137ab7862ec522R42] that doesn't get shut down. Should we use use non-periodic tasks for them? * we're setting {{snapshot_on_duplicate_row_detection}} via config and {{diagnostic_snapshot_interval_nanos}} via system property. I don't mind to have it as-is in current case, but we should generally try to consolidate the way we're managing configuration. > Rows can get duplicated in mixed major-version clusters and after full upgrade > -- > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable >Reporter: Aleksey Yeschenko >Assignee: Marcus Eriksson >Priority: Normal > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ++ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} > Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[ > [distributed_test_keyspace.tbl] key=1 > partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 > columns=[[] | [v]] > Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, > localDeletion=1588588821, [v[c]=d ts=3] > Row[info=[ts=-9223372036854775808] del=deletedAt=1, > localDeletion=1588588821 ]: ck=2 | > Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, > localDeletion=1588588821 > ]) > {code} > {{LegacyLayout}} correctly interprets a range tombstone whose start and > finish {{collectionName}} values don't match as a wrapping fragment of a > legacy row deletion that's being interrupted by a collection deletion > (correctly) - see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. > Quoting the comment inline: > {code} > // Because of the way RangeTombstoneList work, we can have a tombstone where > only one of > // the bound has a collectionName. That happens if we have a big tombstone A > (spanning one > // or multiple
[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110270#comment-17110270 ] Sylvain Lebresne commented on CASSANDRA-15789: -- +1 from me. > Rows can get duplicated in mixed major-version clusters and after full upgrade > -- > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable >Reporter: Aleksey Yeschenko >Assignee: Marcus Eriksson >Priority: Normal > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ++ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} > Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[ > [distributed_test_keyspace.tbl] key=1 > partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 > columns=[[] | [v]] > Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, > localDeletion=1588588821, [v[c]=d ts=3] > Row[info=[ts=-9223372036854775808] del=deletedAt=1, > localDeletion=1588588821 ]: ck=2 | > Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, > localDeletion=1588588821 > ]) > {code} > {{LegacyLayout}} correctly interprets a range tombstone whose start and > finish {{collectionName}} values don't match as a wrapping fragment of a > legacy row deletion that's being interrupted by a collection deletion > (correctly) - see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. > Quoting the comment inline: > {code} > // Because of the way RangeTombstoneList work, we can have a tombstone where > only one of > // the bound has a collectionName. That happens if we have a big tombstone A > (spanning one > // or multiple rows) and a collection tombstone B. In that case, > RangeTombstoneList will > // split this into 3 RTs: the first one from the beginning of A to the > beginning of B, > // then B, then a third one from the end of B to the end of A. To make this > simpler, if > // we detect that case we transform the 1st and 3rd tombstone so they don't > end in the middle > // of a row (which is still correct). > {code} > {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such > a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, > and mistakenly starts a new row, while in the middle of an existing one: (see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1500-L1501]). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17110126#comment-17110126 ] Marcus Eriksson commented on CASSANDRA-15789: - Pushed a few commits to the branches above addressing the comments; * Made {{PartitionUpdate.fromPre30Iterator}} log whenever it [merges rows|https://github.com/krummas/cassandra/commit/7f88a2d60e0fa8da1d328c84d07da3dee6b78b18] * [Revert|https://github.com/krummas/cassandra/commit/aedbfc761fa798c8118409cfc07ece47990578d8] to debug logging for in-jvm dtests * [Make|https://github.com/krummas/cassandra/commit/210da67d6cae3b8c01df315d28e89c455bac488e] it possible to disable read/compaction time duplicate detection * Various [minor|https://github.com/krummas/cassandra/commit/7a18bbea6bb5c1203014396d85fc8a3822c7969a] [fixes|https://github.com/krummas/cassandra/commit/c4373b3b75aa234b4ebd67ffa03ad5553d73e357] > Rows can get duplicated in mixed major-version clusters and after full upgrade > -- > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable >Reporter: Aleksey Yeschenko >Assignee: Marcus Eriksson >Priority: Normal > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ++ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} > Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[ > [distributed_test_keyspace.tbl] key=1 > partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 > columns=[[] | [v]] > Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, > localDeletion=1588588821, [v[c]=d ts=3] > Row[info=[ts=-9223372036854775808] del=deletedAt=1, > localDeletion=1588588821 ]: ck=2 | > Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, > localDeletion=1588588821 > ]) > {code} > {{LegacyLayout}} correctly interprets a range tombstone whose start and > finish {{collectionName}} values don't match as a wrapping fragment of a > legacy row deletion that's being interrupted by a collection deletion > (correctly) - see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. > Quoting the comment inline: > {code} > // Because of the way RangeTombstoneList work, we can have a tombstone where > only one of > // the bound has a collectionName. That happens if we have a big tombstone A > (spanning one > // or multiple rows) and a collection tombstone B. In that case, > RangeTombstoneList will > // split this into 3 RTs: the first one from the beginning of A to the > beginning of B, > // then
[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17104630#comment-17104630 ] Alex Petrov commented on CASSANDRA-15789: - +1 with some minor comments: * {{nowInSeconds}} [here|https://github.com/apache/cassandra/commit/966ae03e778742a94fbbecffe07d977c3a39f70b#diff-984dca80cb7e39e7a99f4928ae4b3ec8R55] seems to be unused * [this|https://github.com/apache/cassandra/compare/trunk...krummas:15789-3.0#diff-984dca80cb7e39e7a99f4928ae4b3ec8R52] can be just {{flush()}} * [here|https://github.com/apache/cassandra/compare/trunk...krummas:15789-3.0#diff-32fe9b86f85fea958f137ab7862ec522R98], we will be logging unconditionally, even if we have sent snapshot messages. On a somewhat related note, we can also use {{CompactionIteratorTest#duplicateRowsTest}} to verify that throttling works by just clearing {{sentMessages}} and making sure we don't issue it again if there's one more duplicate. * I'm not 100% sure which level is the best for in-jvm dtests. Should we keep {{DEBUG}} or should we switch to {{ERROR}}. * should we add some information that shows {{Row/RT/Row}} sandwich, like the one in description? It might make it easier for people to read it in future. * in {{PartitionUpdateTest}}, {{testDuplicate}} and {{testMerge}} seem to be specific to this issue, but we don't have any ticket specific information there. Should we add some motivation/information? In fact, we may consider adding some fuzz tests to test even more scenarios in the future. * There are some (possibly intended) {{printlns}} in {{assertCommandIssued}} Regarding duplicates elimination in {{PartitionUpdate}}, since they'll still be detected and snapshotted, I think this is fine. However, I can imagine a scenario where an erroneous duplicate row can result into data resurrection. But given duplicate rows are by no means a right behaviour, and we already know at least three ways that could lead to such behaviour (12144, 14008, and this issue), merging them to one seems to be a reasonable thing to do, but doesn't always guarantee behaviour one would otherwise expect from the database. It might be good to make it configurable and/or disabled by default. > Rows can get duplicated in mixed major-version clusters and after full upgrade > -- > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable >Reporter: Aleksey Yeschenko >Assignee: Marcus Eriksson >Priority: Normal > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ++ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} >
[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17102676#comment-17102676 ] Sylvain Lebresne commented on CASSANDRA-15789: -- I had a quick look at those commits, and agrees about the fix in `LegacyLayout`. And I have no strong objections on the 2 other parts, but wanted to remark 2 points: - regarding the elimination of duplicates on iterator coming from `LegacyLayout`: the patch currently merge the duplicates rather silently. What if we have another bug in `LegacyLayout` for which row duplication is only one sign, but that also lose data? Are we sure we won't regret not failing on what would be an unknown bug? - Regarding the duplicate check on all reads, I "think" this could have a measurable impact on performance for some workloads. Which isn't a reason not to add it, but as this impact all reads and will go in "stable" versions, do we want to run a few benchmarks to quantify this? Or have a way to disable the check? > Rows can get duplicated in mixed major-version clusters and after full upgrade > -- > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable >Reporter: Aleksey Yeschenko >Assignee: Marcus Eriksson >Priority: Normal > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ++ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} > Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[ > [distributed_test_keyspace.tbl] key=1 > partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 > columns=[[] | [v]] > Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, > localDeletion=1588588821, [v[c]=d ts=3] > Row[info=[ts=-9223372036854775808] del=deletedAt=1, > localDeletion=1588588821 ]: ck=2 | > Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, > localDeletion=1588588821 > ]) > {code} > {{LegacyLayout}} correctly interprets a range tombstone whose start and > finish {{collectionName}} values don't match as a wrapping fragment of a > legacy row deletion that's being interrupted by a collection deletion > (correctly) - see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. > Quoting the comment inline: > {code} > // Because of the way RangeTombstoneList work, we can have a tombstone where > only one of > // the bound has a collectionName. That happens if we have a big tombstone A > (spanning one > // or multiple rows) and a collection tombstone B. In that case, > RangeTombstoneList will
[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100653#comment-17100653 ] Marcus Eriksson commented on CASSANDRA-15789: - Attaching branches which contain commits to; * Add a in-jvm dtest to reproduce. * Add a new {{PartitionUpdate}} method {{fromPre30Iterator}} which merges any subsequent duplicate rows. * Make sure we don't create a new row in legacy layout when encountering a row tombstone after seeing actual data. * Add read and compaction time detection for the duplication, with the ability to automatically snapshot the involved replicas to be able to investigate the sstables. The 3.0 and 3.11 branches contain the fixes while trunk only has the duplication detection || branch || unit tests || dtests || jvm dtests || jvm upgrade dtests || | [3.0|https://github.com/krummas/cassandra/commits/15789-3.0] | [utests|https://circleci.com/gh/krummas/cassandra/3262] | [vnodes|https://circleci.com/gh/krummas/cassandra/3264] [novnodes|https://circleci.com/gh/krummas/cassandra/3265] | [jvm dtests|https://circleci.com/gh/krummas/cassandra/3263] | [upgrade dtests|https://circleci.com/gh/krummas/cassandra/3267] | | [3.11|https://github.com/krummas/cassandra/commits/15789-3.11] | [utests|https://circleci.com/gh/krummas/cassandra/3236] | [vnodes|https://circleci.com/gh/krummas/cassandra/3245] [novnodes|https://circleci.com/gh/krummas/cassandra/3244] | [jvm dtests|https://circleci.com/gh/krummas/cassandra/3235] | [upgrade dtests|https://circleci.com/gh/krummas/cassandra/3249] | | [trunk|https://github.com/krummas/cassandra/commits/15789-trunk]| [utests|https://circleci.com/gh/krummas/cassandra/3238] | [vnodes|https://circleci.com/gh/krummas/cassandra/3247] [novnodes|https://circleci.com/gh/krummas/cassandra/3248] | [jvm dtests|https://circleci.com/gh/krummas/cassandra/3237] | [upgrade dtests|https://circleci.com/gh/krummas/cassandra/3252] | The 3.11 upgrade dtest failure is expected since it uses the current 3.0 dtest jar > Rows can get duplicated in mixed major-version clusters and after full upgrade > -- > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable >Reporter: Aleksey Yeschenko >Assignee: Marcus Eriksson >Priority: Normal > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ++ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} > Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[ > [distributed_test_keyspace.tbl] key=1 > partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 > columns=[[] |
[jira] [Commented] (CASSANDRA-15789) Rows can get duplicated in mixed major-version clusters and after full upgrade
[ https://issues.apache.org/jira/browse/CASSANDRA-15789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100070#comment-17100070 ] Aleksey Yeschenko commented on CASSANDRA-15789: --- {{nodetool scrub}} fixes the issue by collapsing rows with the same clustering into one, via the logic added in CASSANDRA-12144 to address a similar corruption. > Rows can get duplicated in mixed major-version clusters and after full upgrade > -- > > Key: CASSANDRA-15789 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15789 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Coordination, Local/Memtable, Local/SSTable >Reporter: Aleksey Yeschenko >Assignee: Marcus Eriksson >Priority: Normal > > In a mixed 2.X/3.X major version cluster a sequence of row deletes, > collection overwrites, paging, and read repair can cause 3.X nodes to split > individual rows into several rows with identical clustering. This happens due > to 2.X paging and RT semantics, and a 3.X {{LegacyLayout}} deficiency. > To reproduce, set up a 2-node mixed major version cluster with the following > table: > {code} > CREATE TABLE distributed_test_keyspace.tlb ( > pk int, > ck int, > v map, > PRIMARY KEY (pk, ck) > ); > {code} > 1. Using either node as the coordinator, delete the row with ck=2 using > timestamp 1 > {code} > DELETE FROM tbl USING TIMESTAMP 1 WHERE pk = 1 AND ck = 2; > {code} > 2. Using either node as the coordinator, insert the following 3 rows: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 1, {'e':'f'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 3; > INSERT INTO tbl (pk, ck, v) VALUES (1, 3, {'i':'j'}) USING TIMESTAMP 3; > {code} > 3. Flush the table on both nodes > 4. Using the 2.2 node as the coordinator, force read repar by querying the > table with page size = 2: > > {code} > SELECT * FROM tbl; > {code} > 5. Overwrite the row with ck=2 using timestamp 5: > {code} > INSERT INTO tbl (pk, ck, v) VALUES (1, 2, {'g':'h'}) USING TIMESTAMP 5;}} > {code} > 6. Query the 3.0 node and observe the split row: > {code} > cqlsh> select * from distributed_test_keyspace.tlb ; > pk | ck | v > ++ > 1 | 1 | {'e': 'f'} > 1 | 2 | {'g': 'h'} > 1 | 2 | {'k': 'l'} > 1 | 3 | {'i': 'j'} > {code} > This happens because the read to query the second page ends up generating the > following mutation for the 3.0 node: > {code} > ColumnFamily(tbl -{deletedAt=-9223372036854775808, localDeletion=2147483647, > ranges=[2:v:_-2:v:!, deletedAt=2, localDeletion=1588588821] > [2:v:!-2:!, deletedAt=1, localDeletion=1588588821] > [3:v:_-3:v:!, deletedAt=2, localDeletion=1588588821]}- > [2:v:63:false:1@3,]) > {code} > Which on 3.0 side gets incorrectly deserialized as > {code} > Mutation(keyspace='distributed_test_keyspace', key='0001', modifications=[ > [distributed_test_keyspace.tbl] key=1 > partition_deletion=deletedAt=-9223372036854775808, localDeletion=2147483647 > columns=[[] | [v]] > Row[info=[ts=-9223372036854775808] ]: ck=2 | del(v)=deletedAt=2, > localDeletion=1588588821, [v[c]=d ts=3] > Row[info=[ts=-9223372036854775808] del=deletedAt=1, > localDeletion=1588588821 ]: ck=2 | > Row[info=[ts=-9223372036854775808] ]: ck=3 | del(v)=deletedAt=2, > localDeletion=1588588821 > ]) > {code} > {{LegacyLayout}} correctly interprets a range tombstone whose start and > finish {{collectionName}} values don't match as a wrapping fragment of a > legacy row deletion that's being interrupted by a collection deletion > (correctly) - see > [code|https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/LegacyLayout.java#L1874-L1889]. > Quoting the comment inline: > {code} > // Because of the way RangeTombstoneList work, we can have a tombstone where > only one of > // the bound has a collectionName. That happens if we have a big tombstone A > (spanning one > // or multiple rows) and a collection tombstone B. In that case, > RangeTombstoneList will > // split this into 3 RTs: the first one from the beginning of A to the > beginning of B, > // then B, then a third one from the end of B to the end of A. To make this > simpler, if > // we detect that case we transform the 1st and 3rd tombstone so they don't > end in the middle > // of a row (which is still correct). > {code} > {{LegacyLayout#addRowTombstone()}} method then chokes when it encounters such > a tombstone in the middle of an existing row - having seen {{v[c]=d}} first, > and mistakenly starts a new row, while in the middle of an existing one: (see >