[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds

2019-09-17 Thread Jon Meredith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931641#comment-16931641
 ] 

Jon Meredith commented on CASSANDRA-15327:
--

Thanks for double-checking [~jbaker200]. 

> Deleted data can re-appear if range movement streaming time exceeds 
> gc_grace_seconds
> 
>
> Key: CASSANDRA-15327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15327
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: Leon Zaruvinsky
>Priority: Normal
> Fix For: 2.2.15, 2.1.x
>
> Attachments: CASSANDRA-15327-2.1.txt, CASSANDRA-15327-2.2.txt
>
>
> Hey,
> We've come across a scenario in production (noticed on Cassandra 2.2.14) 
> where data that is deleted from Cassandra at consistency {{ALL}} can be 
> resurrected.  I've added a reproduction in a comment.
> If a {{delete}} is issued during a range movement (i.e. bootstrap, 
> decommission, move), and {{gc_grace_seconds}} is surpassed before the stream 
> is finished, then the tombstones from the {{delete}} can be purged from the 
> recipient node before the data is streamed. Once the move is complete, the 
> data now exists on the recipient node without a tombstone.
> We noticed this because our bootstrapping time occasionally exceeds our 
> configured gc_grace_seconds, so we lose the consistency guarantee.  As an 
> operator, it would be great to not have to worry about this edge case.
> I've attached a patch that we have tested and successfully used in 
> production, and haven't noticed any ill effects.  Happy to submit patches for 
> more recent versions, I'm not sure how cleanly this will actually merge since 
> there was some refactoring to this logic in 3.x.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds

2019-09-17 Thread James Baker (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931412#comment-16931412
 ] 

James Baker commented on CASSANDRA-15327:
-

And to answer my own question...

In Cassandra 2.2, it appears that the repaired_at flag is indeed set to 
unrepaired when bootstrapping. In trunk, it's less obvious that this is the 
case. That said, we stream ranges, and all of the SSTables for the range are 
added to the column family store at once, and so barring someone reloading 
SSTables during the bootstrap procedure, it probably works as intended.

> Deleted data can re-appear if range movement streaming time exceeds 
> gc_grace_seconds
> 
>
> Key: CASSANDRA-15327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15327
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: Leon Zaruvinsky
>Priority: Normal
> Fix For: 2.2.15, 2.1.x
>
> Attachments: CASSANDRA-15327-2.1.txt, CASSANDRA-15327-2.2.txt
>
>
> Hey,
> We've come across a scenario in production (noticed on Cassandra 2.2.14) 
> where data that is deleted from Cassandra at consistency {{ALL}} can be 
> resurrected.  I've added a reproduction in a comment.
> If a {{delete}} is issued during a range movement (i.e. bootstrap, 
> decommission, move), and {{gc_grace_seconds}} is surpassed before the stream 
> is finished, then the tombstones from the {{delete}} can be purged from the 
> recipient node before the data is streamed. Once the move is complete, the 
> data now exists on the recipient node without a tombstone.
> We noticed this because our bootstrapping time occasionally exceeds our 
> configured gc_grace_seconds, so we lose the consistency guarantee.  As an 
> operator, it would be great to not have to worry about this edge case.
> I've attached a patch that we have tested and successfully used in 
> production, and haven't noticed any ill effects.  Happy to submit patches for 
> more recent versions, I'm not sure how cleanly this will actually merge since 
> there was some refactoring to this logic in 3.x.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds

2019-09-17 Thread James Baker (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931356#comment-16931356
 ] 

James Baker commented on CASSANDRA-15327:
-

Hi [~jmeredithco] 

>From first glance, it's not obvious to me that the solution given in 
>CASSANDRA-6343 cannot suffer from the exact same issue.

Imagine I have SSTables A and B which have both been repaired, and A has a 
tombstone for a value contained in B. A new node is bootstrapped and it 
downloads A, and then for some reason a pause occurs.

What stops the tombstone from being compacted away at this point? All SSTables 
on the new node have been repaired, I would presume (unless bootstrap marks 
SSTables as being unrepaired).

> Deleted data can re-appear if range movement streaming time exceeds 
> gc_grace_seconds
> 
>
> Key: CASSANDRA-15327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15327
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: Leon Zaruvinsky
>Priority: Normal
> Fix For: 2.2.15, 2.1.x
>
> Attachments: CASSANDRA-15327-2.1.txt, CASSANDRA-15327-2.2.txt
>
>
> Hey,
> We've come across a scenario in production (noticed on Cassandra 2.2.14) 
> where data that is deleted from Cassandra at consistency {{ALL}} can be 
> resurrected.  I've added a reproduction in a comment.
> If a {{delete}} is issued during a range movement (i.e. bootstrap, 
> decommission, move), and {{gc_grace_seconds}} is surpassed before the stream 
> is finished, then the tombstones from the {{delete}} can be purged from the 
> recipient node before the data is streamed. Once the move is complete, the 
> data now exists on the recipient node without a tombstone.
> We noticed this because our bootstrapping time occasionally exceeds our 
> configured gc_grace_seconds, so we lose the consistency guarantee.  As an 
> operator, it would be great to not have to worry about this edge case.
> I've attached a patch that we have tested and successfully used in 
> production, and haven't noticed any ill effects.  Happy to submit patches for 
> more recent versions, I'm not sure how cleanly this will actually merge since 
> there was some refactoring to this logic in 3.x.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds

2019-09-16 Thread Jon Meredith (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930962#comment-16930962
 ] 

Jon Meredith commented on CASSANDRA-15327:
--

Thanks for reporting the issue.  From what you're describing Cassandra is 
behaving as designed, the gc_grace_seconds should be set long enough that you 
can complete repair on the cluster.

Configuring an absolute gcgc is not ideal as you cannot always control how long 
repair will take in the face of outages and other issues.
CASSANDRA-6434 implements a better method in 3.0 where gcgs is automatically 
set based on repair times.


> Deleted data can re-appear if range movement streaming time exceeds 
> gc_grace_seconds
> 
>
> Key: CASSANDRA-15327
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15327
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: Leon Zaruvinsky
>Priority: Normal
> Fix For: 2.2.15, 2.1.x
>
> Attachments: CASSANDRA-15327-2.1.txt, CASSANDRA-15327-2.2.txt
>
>
> Hey,
> We've come across a scenario in production (noticed on Cassandra 2.2.14) 
> where data that is deleted from Cassandra at consistency {{ALL}} can be 
> resurrected.  I've added a reproduction in a comment.
> If a {{delete}} is issued during a range movement (i.e. bootstrap, 
> decommission, move), and {{gc_grace_seconds}} is surpassed before the stream 
> is finished, then the tombstones from the {{delete}} can be purged from the 
> recipient node before the data is streamed. Once the move is complete, the 
> data now exists on the recipient node without a tombstone.
> We noticed this because our bootstrapping time occasionally exceeds our 
> configured gc_grace_seconds, so we lose the consistency guarantee.  As an 
> operator, it would be great to not have to worry about this edge case.
> I've attached a patch that we have tested and successfully used in 
> production, and haven't noticed any ill effects.  Happy to submit patches for 
> more recent versions, I'm not sure how cleanly this will actually merge since 
> there was some refactoring to this logic in 3.x.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15327) Deleted data can re-appear if range movement streaming time exceeds gc_grace_seconds

2019-09-16 Thread Leon Zaruvinsky (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16930940#comment-16930940
 ] 

Leon Zaruvinsky commented on CASSANDRA-15327:
-

h2. Reproduction (with Bootstrap)
h3. Cluster setup

Apache Cassandra 2.2.14, 3 nodes with SimpleSeedProvider.

 
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.180 101.68 KB 512 66.7% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 101.64 KB 512 67.6% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 101.69 KB 512 65.7% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1
{code}
h3. Seed Data

We create a table {{datadump.ddt}} using {{LeveledCompactionStrategy}}, with 
{{RF = 3}} and {{gc_grace_seconds = 1}}. We write at {{QUORUM}} 20,000 rows of 
data in the format (row, value) where row is a partition key [0, 2) and 
value is a one megabyte blob.

 
{code:java}
cqlsh> describe datadump;

CREATE KEYSPACE datadump WITH replication = {'class': 'SimpleStrategy', 
'replication_factor': '3'} AND durable_writes = true;

CREATE TABLE datadump.ddt (
 row bigint,
 col bigint,
 value blob static,
 PRIMARY KEY (row, col)
) WITH CLUSTERING ORDER BY (col ASC)
 AND bloom_filter_fp_chance = 0.1
 AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
 AND comment = ''
 AND compaction = {'class': 
'org.apache.cassandra.db.compaction.LeveledCompactionStrategy'}
 AND compression = {}
 AND dclocal_read_repair_chance = 0.1
 AND default_time_to_live = 0
 AND gc_grace_seconds = 1
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair_chance = 0.0
 AND speculative_retry = '99.0PERCENTILE';

cqlsh> SELECT COUNT(row) from datadump.ddt;

system.count(row)

2
 
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.180 19.26 GB 512 100.0% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 19.26 GB 512 100.0% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 18.71 GB 512 100.0% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1
{code}
h3. Bootstrap + Delete

We bootstrap a fourth node to the cluster, and as soon as it begins to stream 
data, we delete at {{ALL}} the 20,000 rows that we inserted earlier. The 
deletes will be forwarded to the bootstrapping node immediately. As soon as the 
delete queries complete, we run a flush and major compaction on the 
bootstrapping node to clear the tombstones.

 
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UJ x.x.x.129 14.43 KB 512 ? 8418418b-2aa2-4918-a065-8fc25887194f RAC1
UN x.x.x.180 19.26 GB 512 100.0% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 19.26 GB 512 100.0% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 18.71 GB 512 100.0% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1
// First, we trigger deletes on one of the original three nodes.
// Then, we run the below on the bootstrapping node
$ nodetool flush && nodetool compact
$ nodetool compactionhistory
{code}
 


Compaction History:
{code:java}
id keyspace_name columnfamily_name compacted_at bytes_in bytes_out rows_merged
b63e7e30-d00a-11e9-a7b9-f18b3a65a899 datadump ddt 1567708003731 366768 0 {}
7e2253a0-d00a-11e9-a7b9-f18b3a65a899 system local 1567707909594 10705 10500 
{4:1}
{code}
h3. Final State

Once the bootstrap has completed, the cluster looks like this:
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1
UN x.x.x.180 19.54 GB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 19.54 GB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 19.54 GB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1{code}
We run a flush and major compaction on every node. The original three nodes 
drop everything, but the bootstrapped node holds onto nearly 75% of the data.
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.129 14.93 GB 512 76.1% 8418418b-2aa2-4918-a065-8fc25887194f RAC1
UN x.x.x.180 143.39 KB 512 74.5% 5b073424-1f21-4aac-acd5-e0e8e82f7073 RAC1
UN x.x.x.217 128.6 KB 512 75.3% 20e9ecd3-b3a5-4171-901c-eb6e7bc7a429 RAC1
UN x.x.x.142 149.97 KB 512 74.1% 1df2908e-5697-4b1a-9110-dcf581511510 RAC1{code}
If we run the count query again, we see the data get read-repaired back into 
the other nodes.

 
{code:java}
$ nodetool status
Datacenter: DC11

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN x.x.x.129