[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds
[ https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931641#comment-16931641 ] Jon Meredith commented on CASSANDRA-15327: -- Thanks for double-checking [~jbaker200]. > Deleted data can re-appear if range movement streaming time exceeds > gc_grace_seconds > > > Key: CASSANDRA-15327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15327 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Leon Zaruvinsky >Priority: Normal > Fix For: 2.2.15, 2.1.x > > Attachments: CASSANDRA-15327-2.1.txt, CASSANDRA-15327-2.2.txt > > > Hey, > We've come across a scenario in production (noticed on Cassandra 2.2.14) > where data that is deleted from Cassandra at consistency {{ALL}} can be > resurrected. I've added a reproduction in a comment. > If a {{delete}} is issued during a range movement (i.e. bootstrap, > decommission, move), and {{gc_grace_seconds}} is surpassed before the stream > is finished, then the tombstones from the {{delete}} can be purged from the > recipient node before the data is streamed. Once the move is complete, the > data now exists on the recipient node without a tombstone. > We noticed this because our bootstrapping time occasionally exceeds our > configured gc_grace_seconds, so we lose the consistency guarantee. As an > operator, it would be great to not have to worry about this edge case. > I've attached a patch that we have tested and successfully used in > production, and haven't noticed any ill effects. Happy to submit patches for > more recent versions, I'm not sure how cleanly this will actually merge since > there was some refactoring to this logic in 3.x. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds
[ https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931412#comment-16931412 ] James Baker commented on CASSANDRA-15327: - And to answer my own question... In Cassandra 2.2, it appears that the repaired_at flag is indeed set to unrepaired when bootstrapping. In trunk, it's less obvious that this is the case. That said, we stream ranges, and all of the SSTables for the range are added to the column family store at once, and so barring someone reloading SSTables during the bootstrap procedure, it probably works as intended. > Deleted data can re-appear if range movement streaming time exceeds > gc_grace_seconds > > > Key: CASSANDRA-15327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15327 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Leon Zaruvinsky >Priority: Normal > Fix For: 2.2.15, 2.1.x > > Attachments: CASSANDRA-15327-2.1.txt, CASSANDRA-15327-2.2.txt > > > Hey, > We've come across a scenario in production (noticed on Cassandra 2.2.14) > where data that is deleted from Cassandra at consistency {{ALL}} can be > resurrected. I've added a reproduction in a comment. > If a {{delete}} is issued during a range movement (i.e. bootstrap, > decommission, move), and {{gc_grace_seconds}} is surpassed before the stream > is finished, then the tombstones from the {{delete}} can be purged from the > recipient node before the data is streamed. Once the move is complete, the > data now exists on the recipient node without a tombstone. > We noticed this because our bootstrapping time occasionally exceeds our > configured gc_grace_seconds, so we lose the consistency guarantee. As an > operator, it would be great to not have to worry about this edge case. > I've attached a patch that we have tested and successfully used in > production, and haven't noticed any ill effects. Happy to submit patches for > more recent versions, I'm not sure how cleanly this will actually merge since > there was some refactoring to this logic in 3.x. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds
[ https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16931356#comment-16931356 ] James Baker commented on CASSANDRA-15327: - Hi [~jmeredithco] >From first glance, it's not obvious to me that the solution given in >CASSANDRA-6343 cannot suffer from the exact same issue. Imagine I have SSTables A and B which have both been repaired, and A has a tombstone for a value contained in B. A new node is bootstrapped and it downloads A, and then for some reason a pause occurs. What stops the tombstone from being compacted away at this point? All SSTables on the new node have been repaired, I would presume (unless bootstrap marks SSTables as being unrepaired). > Deleted data can re-appear if range movement streaming time exceeds > gc_grace_seconds > > > Key: CASSANDRA-15327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15327 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Leon Zaruvinsky >Priority: Normal > Fix For: 2.2.15, 2.1.x > > Attachments: CASSANDRA-15327-2.1.txt, CASSANDRA-15327-2.2.txt > > > Hey, > We've come across a scenario in production (noticed on Cassandra 2.2.14) > where data that is deleted from Cassandra at consistency {{ALL}} can be > resurrected. I've added a reproduction in a comment. > If a {{delete}} is issued during a range movement (i.e. bootstrap, > decommission, move), and {{gc_grace_seconds}} is surpassed before the stream > is finished, then the tombstones from the {{delete}} can be purged from the > recipient node before the data is streamed. Once the move is complete, the > data now exists on the recipient node without a tombstone. > We noticed this because our bootstrapping time occasionally exceeds our > configured gc_grace_seconds, so we lose the consistency guarantee. As an > operator, it would be great to not have to worry about this edge case. > I've attached a patch that we have tested and successfully used in > production, and haven't noticed any ill effects. Happy to submit patches for > more recent versions, I'm not sure how cleanly this will actually merge since > there was some refactoring to this logic in 3.x. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds
[ https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930962#comment-16930962 ] Jon Meredith commented on CASSANDRA-15327: -- Thanks for reporting the issue. From what you're describing Cassandra is behaving as designed, the gc_grace_seconds should be set long enough that you can complete repair on the cluster. Configuring an absolute gcgc is not ideal as you cannot always control how long repair will take in the face of outages and other issues. CASSANDRA-6434 implements a better method in 3.0 where gcgs is automatically set based on repair times. > Deleted data can re-appear if range movement streaming time exceeds > gc_grace_seconds > > > Key: CASSANDRA-15327 > URL: https://issues.apache.org/jira/browse/CASSANDRA-15327 > Project: Cassandra > Issue Type: Bug > Components: Consistency/Bootstrap and Decommission >Reporter: Leon Zaruvinsky >Priority: Normal > Fix For: 2.2.15, 2.1.x > > Attachments: CASSANDRA-15327-2.1.txt, CASSANDRA-15327-2.2.txt > > > Hey, > We've come across a scenario in production (noticed on Cassandra 2.2.14) > where data that is deleted from Cassandra at consistency {{ALL}} can be > resurrected. I've added a reproduction in a comment. > If a {{delete}} is issued during a range movement (i.e. bootstrap, > decommission, move), and {{gc_grace_seconds}} is surpassed before the stream > is finished, then the tombstones from the {{delete}} can be purged from the > recipient node before the data is streamed. Once the move is complete, the > data now exists on the recipient node without a tombstone. > We noticed this because our bootstrapping time occasionally exceeds our > configured gc_grace_seconds, so we lose the consistency guarantee. As an > operator, it would be great to not have to worry about this edge case. > I've attached a patch that we have tested and successfully used in > production, and haven't noticed any ill effects. Happy to submit patches for > more recent versions, I'm not sure how cleanly this will actually merge since > there was some refactoring to this logic in 3.x. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org
[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds
[ https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16930940#comment-16930940 ] Leon Zaruvinsky commented on CASSANDRA-15327: - h2. Reproduction (with Bootstrap) h3. Cluster setup Apache Cassandra 2.2.14, 3 nodes with SimpleSeedProvider. {code:java} $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN x.x.x.180 101.68 KB 512 66.7% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 101.64 KB 512 67.6% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 101.69 KB 512 65.7% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1 {code} h3. Seed Data We create a table {{datadump.ddt}} using {{LeveledCompactionStrategy}}, with {{RF = 3}} and {{gc_grace_seconds = 1}}. We write at {{QUORUM}} 20,000 rows of data in the format (row, value) where row is a partition key [0, 2) and value is a one megabyte blob. {code:java} cqlsh> describe datadump; CREATE KEYSPACE datadump WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '3'} AND durable_writes = true; CREATE TABLE datadump.ddt ( row bigint, col bigint, value blob static, PRIMARY KEY (row, col) ) WITH CLUSTERING ORDER BY (col ASC) AND bloom_filter_fp_chance = 0.1 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}' AND comment = '' AND compaction = {'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'} AND compression = {} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 1 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; cqlsh> SELECT COUNT(row) from datadump.ddt; system.count(row) 2 $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN x.x.x.180 19.26 GB 512 100.0% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 19.26 GB 512 100.0% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 18.71 GB 512 100.0% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1 {code} h3. Bootstrap + Delete We bootstrap a fourth node to the cluster, and as soon as it begins to stream data, we delete at {{ALL}} the 20,000 rows that we inserted earlier. The deletes will be forwarded to the bootstrapping node immediately. As soon as the delete queries complete, we run a flush and major compaction on the bootstrapping node to clear the tombstones. {code:java} $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UJ x.x.x.129 14.43 KB 512 ? 8418418b-2aa2-4918-a065-8fc25887194f RAC1 UN x.x.x.180 19.26 GB 512 100.0% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 19.26 GB 512 100.0% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 18.71 GB 512 100.0% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1 // First, we trigger deletes on one of the original three nodes. // Then, we run the below on the bootstrapping node $ nodetool flush && nodetool compact $ nodetool compactionhistory {code} Compaction History: {code:java} id keyspace_name columnfamily_name compacted_at bytes_in bytes_out rows_merged b63e7e30-d00a-11e9-a7b9-f18b3a65a899 datadump ddt 1567708003731 366768 0 {} 7e2253a0-d00a-11e9-a7b9-f18b3a65a899 system local 1567707909594 10705 10500 {4:1} {code} h3. Final State Once the bootstrap has completed, the cluster looks like this: {code:java} $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1 UN x.x.x.180 19.54 GB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 19.54 GB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 19.54 GB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1{code} We run a flush and major compaction on every node. The original three nodes drop everything, but the bootstrapped node holds onto nearly 75% of the data. {code:java} $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1 UN x.x.x.180 143.39 KB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1 UN x.x.x.217 128.6 KB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1 UN x.x.x.142 149.97 KB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1{code} If we run the count query again, we see the data get read-repaired back into the other nodes. {code:java} $ nodetool status Datacenter: DC11 Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns (effective) Host ID Rack