[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-12-06 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793789#comment-17793789
 ] 

Brandon Williams commented on CASSANDRA-16418:
--

bq. Feel free to create a new ticket to add it back or piggyback in some other 
ticket, I'd be glad to review.

That would be CASSANDRA-18824

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0-alpha1, 5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-12-06 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793739#comment-17793739
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

bq. However, from the API pov CompactionManager.performCleanup can be now 
called anytime - I think it was important precondition for that method - 
wouldn't be good to keep it there, just changing the condition to check pending 
ranges rather than joining status?

Good point, this was overlooked during review - I suggested removing that just 
to cleanup but looking back I think there is value in keeping it for safety if 
this API is used elsewhere. Feel free to create a new ticket to add it back or 
piggyback in some other ticket, I'd be glad to review.

To me it'd be nice that CompactionManager API is a dumb local API unaware of 
token ranges/membership status since it's just a local operation, but 
practically these concerns are mixed across the codebase so developers expect 
that any local API is safe from a distributed standpoint.

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0-alpha1, 5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-12-06 Thread Jacek Lewandowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17793600#comment-17793600
 ] 

Jacek Lewandowski commented on CASSANDRA-16418:
---

bq. I think that check was deemed unnecessary after a new check was added to 
StorageService.forceKeyspaceCleanup to prevent starting cleanup when there are 
pending ranges (ie. when a node is joining).

[~paulo] - it looks ok from the user point of view. However, from the API pov 
{{CompactionManager.performCleanup}} can be now called anytime - I think it was 
important precondition for that method - wouldn't be good to keep it there, 
just changing the condition to check pending ranges rather than joining status?


> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0-alpha1, 5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-12-04 Thread Jacek Lewandowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792967#comment-17792967
 ] 

Jacek Lewandowski commented on CASSANDRA-16418:
---

Thanks [~samt]

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0-alpha1, 5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-12-04 Thread Sam Tunnicliffe (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792966#comment-17792966
 ] 

Sam Tunnicliffe commented on CASSANDRA-16418:
-

The TCM patch removed this check from {{forceKeyspaceCleanup}} and replaced it 
with more granular and consistent checks as cleanup is run for each CFS.  

With TCM there isn't the same concept of pending ranges. Instead, replica sets 
for reads and writes are independent and modified separately during range 
movements. If a node will acquire a range as part of a range movement, it is 
added as a write replica at the start of the operation, before any streaming 
takes place. There's a [walkthrough of 
this|https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-21%3A+Transactional+Cluster+Metadata#CEP21:TransactionalClusterMetadata-MappingClusterOperationstoMetadataTransitions(Events)]
 in the CEP doc. 

Owned ranges for cleanup of a CFS are computed when the cleanup task is 
submitted, so if any range movement has started by that point the ranges 
involved would already be known to cleanup. What we don't have, and which the 
original check in {{forceKeyspaceCleanup}} also did not guard against, is a 
range movement which starts in the window between grabbing the owned ranges 
[here|https://github.com/apache/cassandra/blob/ae0842372ff6dd1437d026f82968a3749f555ff4/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L634-L642]
 and selecting the sstables for the cleanup task 
[here|https://github.com/apache/cassandra/blob/ae0842372ff6dd1437d026f82968a3749f555ff4/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L656].
 The safest way to protect against this is probably to simply cancel any 
running cleanup tasks when the set of local ranges is modified, in 
{{CFS::invalidateLocalRanges}}.


> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0-alpha1, 5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-12-04 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792862#comment-17792862
 ] 

Brandon Williams commented on CASSANDRA-16418:
--

Trunk does have 
[this|https://github.com/apache/cassandra/blame/trunk/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L624-L628]
 check though, which was added by TCM.  It's not clear to me either why this 
disparity exists.

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0-alpha1, 5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-12-04 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792847#comment-17792847
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

{quote}Why that check in CompactionManager was removed? Was it needed for tests 
to make them run? I'm afraid that the check could have been legit for 
production use.
{quote}
I think that check was deemed unnecessary after a new check was added to 
[StorageService.forceKeyspaceCleanup|https://github.com/apache/cassandra/blob/cassandra-4.1/src/java/org/apache/cassandra/service/StorageService.java#L3907]
 to prevent starting cleanup when there are pending ranges (ie. when a node is 
joining).

It's not clear to me why this latter check is not present in 
[trunk|https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L2524]
 (while it's present in 4.0/4.1).

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0-alpha1, 5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-12-04 Thread Jacek Lewandowski (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17792799#comment-17792799
 ] 

Jacek Lewandowski commented on CASSANDRA-16418:
---

Why that check in {{CompactionManager}} was removed? Was it needed for tests to 
make them run? I'm afraid that the check could have been legit for production 
use.

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0-alpha1, 5.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-09-06 Thread Szymon Miezal (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762444#comment-17762444
 ] 

Szymon Miezal commented on CASSANDRA-16418:
---

[~brandon.williams] that's reasonable, thank you for confirmation.

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-09-06 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762442#comment-17762442
 ] 

Brandon Williams commented on CASSANDRA-16418:
--

Let's use a new ticket so the lineage can be tracked more clearly.

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-09-06 Thread Szymon Miezal (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762441#comment-17762441
 ] 

Szymon Miezal commented on CASSANDRA-16418:
---

Thank you [~smiklosovic], now knowing that decision wasn't deliberate I think 
it make sense to prepare those patches. Shall it be a separate ticket for the 
sake of doing the backporting work or should we still use this one?

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-09-06 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762438#comment-17762438
 ] 

Stefan Miklosovic commented on CASSANDRA-16418:
---

hey [~szymon.miezal] , if you prepare a patch for that (while you are on it, 
why not for 3.0 too if that problem is there as well?) then I can commit that 
for you. 

We probably just thought that branches lower from 4.0 are not worth the effort. 
I do not think there is any special reason behind that.

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-09-06 Thread Szymon Miezal (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762436#comment-17762436
 ] 

Szymon Miezal commented on CASSANDRA-16418:
---

Given it appears to be a genuine problem on 3.11, is there any reason why it 
hasn't been merge/ported to that version?

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.8, 4.1.1, 5.0
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-01-23 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680021#comment-17680021
 ] 

Stefan Miklosovic commented on CASSANDRA-16418:
---

+1. thanks!

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.x
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-01-20 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679268#comment-17679268
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

Test failures look unrelated - for instance 
[test_decommissioned_node_cant_rejoin|https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/2205/testReport/junit/dtest.topology_test/TestTopology/test_decommissioned_node_cant_rejoin/]
 failed on 4.0 and 
[test_simultaneous_bootstrap|https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/2207/testReport/junit/dtest-novnode.bootstrap_test/TestBootstrap/test_simultaneous_bootstrap/]
 on trunk, but they don't seem to be at all related to this ticket. I ran these 
tests locally on the respective branches and they're passing.

ok to merge [~smiklosovic]?

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.x
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-01-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678437#comment-17678437
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

Resubmitted CI after [test 
fix|https://github.com/linzuro/cassandra/commit/8de9c73d28291d2df67727ffcb7292f8c21b3442]:
|branch||CI||
|[CASSANDRA-16418-4.0|https://github.com/pauloricardomg/cassandra/tree/CASSANDRA-16418-4.0]|[#2205|https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/2205/]
 (running)|
|[CASSANDRA-16418-4.1|https://github.com/pauloricardomg/cassandra/tree/CASSANDRA-16418-4.1]|[#2206|https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/2206/]
 (running)|
|[CASSANDRA-16418-trunk|https://github.com/pauloricardomg/cassandra/tree/CASSANDRA-16418-trunk]|[#2207|https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/2207/]
 (queued)|

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.x
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-01-18 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17678335#comment-17678335
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

There seems to be a legit failure on 
[CleanupTest|https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/2202/testReport/org.apache.cassandra.db/CleanupTest/testCleanup/]

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.x
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-01-17 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677987#comment-17677987
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

Prepared [~linzuro]'s patch for commit on 4.0/4.1/trunk and submitted CI:


|branch||CI||
|[CASSANDRA-16418-4.0|https://github.com/pauloricardomg/cassandra/tree/CASSANDRA-16418-4.0]|[#2202|https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/2202/]
 (running)|
|[CASSANDRA-16418-4.1|https://github.com/pauloricardomg/cassandra/tree/CASSANDRA-16418-4.1]|[#2203|https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/2202/]
 (running)|
|[CASSANDRA-16418-trunk|https://github.com/pauloricardomg/cassandra/tree/CASSANDRA-16418-trunk]|[#2204|https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/2204/]
 (queued)|

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.x
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-01-17 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17677879#comment-17677879
 ] 

Stefan Miklosovic commented on CASSANDRA-16418:
---

Thanks [~linzuro] for taking care of that. There is one unused import which 
fails the build. I am overall +1 when we build this successfully.

[~paulo] would you mind to take the lead here and eventually merge it, please?

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.x
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-01-13 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17676706#comment-17676706
 ] 

Stefan Miklosovic commented on CASSANDRA-16418:
---

I ve commented on this https://github.com/apache/cassandra/pull/2061/files

I am +1 on successful build.

Comments in the PR are just nits, I leave this to the author's discretion.

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.x
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-01-12 Thread Stefan Miklosovic (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675941#comment-17675941
 ] 

Stefan Miklosovic commented on CASSANDRA-16418:
---

could you please create a PR from your branch so we may potentially comment on 
it? 

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-01-11 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675740#comment-17675740
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

I rebased and squashed Lindsey's commit [on this 
branch|https://github.com/pauloricardomg/cassandra/tree/CASSANDRA-16418] + 
updated tests [from this 
commit|https://github.com/pauloricardomg/cassandra/commit/702f77d247893a51461823268ad6a20cd6c1a021]
 and submitted CI on 
https://github.com/pauloricardomg/cassandra/tree/CASSANDRA-16418 (still queued).

I think this is ready for a second round of review. [~JoshuaMcKenzie] 
[~stefan.miklosovic] would you have cycles to take a look?

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2023-01-11 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17675737#comment-17675737
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

In order to check the tests were reliably reproducing the issue on Lindsey's 
[branch|https://github.com/apache/cassandra/pull/2061] I commented out the 
following excerpt:
{noformat}
InetAddressAndPort localAddress = 
FBUtilities.getBroadcastAddressAndPort();
Integer pendingRangesCount = 
tokenMetadata.getPendingRanges(keyspaceName, localAddress).size();

if (pendingRangesCount > 0)
{
throw new RuntimeException("Node is involved in cluster membership 
changes. Not safe to run cleanup.");
}
{noformat}
And expected both 
[testCleanupFailsDuringOngoingDecommission|https://github.com/apache/cassandra/pull/2061/files#diff-68d2cd75caa0e4091c7206717116594bdcb0aab38f72f6d6afa44eac60466e13R41]
 and 
[testCleanupFailsDuringOngoingBootstrap|https://github.com/apache/cassandra/pull/2061/files#diff-68d2cd75caa0e4091c7206717116594bdcb0aab38f72f6d6afa44eac60466e13R85]
 to fail.

Even though the tests failed most of the time, sometimes the tests passed so 
data was not being wrongly cleaned up as expected.

The reason for this is that these tests require that the cleanup is executed 
between the sstables are transferred by streaming and the ring membership 
operation is finished. There is a small chance cleanup is not executed within 
this window so the issue will not reproduce, especially if we run this test on 
faster hardware.

I took a slightly different testing approach on [this 
commit|https://github.com/pauloricardomg/cassandra/blob/702f77d247893a51461823268ad6a20cd6c1a021/test/distributed/org/apache/cassandra/distributed/test/ring/CleanupFailureTest.java#L40]
 that inserts data while a node is bootstrapping or decommissioning and checks 
the data is present after a cleanup is run. This was able to reliably reproduce 
the issue when the excerpt above is commented out.

The updated test is more deterministic because we don't depend on streaming nor 
timing. Furthermore this makes the test faster since we don't need so many rows 
to reproduce the issue, which is needed with the streaming approach.

A nice benefit of this approach is that since we only run cleanup a single time 
while the node is bootstrapping/decommissioning, we're able to [verify that the 
cleanup fails with the expected error 
message|https://github.com/pauloricardomg/cassandra/blob/702f77d247893a51461823268ad6a20cd6c1a021/test/distributed/org/apache/cassandra/distributed/test/ring/CleanupFailureTest.java#L105].

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
> Fix For: 4.0.x
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2022-12-28 Thread Paulo Motta (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17652521#comment-17652521
 ] 

Paulo Motta commented on CASSANDRA-16418:
-

Nice work [~linzuro]! The approach and test looks mostly good to me, added a 
few comments to the PR.

Can you add a similar regression test for bootstrap? The test should fail when 
the bootstrap safeguard is removed. I think you can find some bootstrap dtest 
examples on \{{org.apache.cassandra.distributed.test.ring.BootstrapTest}}.

I have submitted a preliminary CI run for your branch on:
* https://ci-cassandra.apache.org/view/patches/job/Cassandra-devbranch/2151/

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-16418) Unsafe to run nodetool cleanup during bootstrap or decommission

2022-12-22 Thread Lindsey Zurovchak (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-16418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651377#comment-17651377
 ] 

Lindsey Zurovchak commented on CASSANDRA-16418:
---

Created [PR|https://github.com/apache/cassandra/pull/2061] for this and made 
the following changes:
 * Added check during cleanup to ensure the node has no pending ranges before 
proceeding
 * Bug did not exist for bootstrap due to existing safety check but the check 
was one level below other safeguard checks so moved it to same location

> Unsafe to run nodetool cleanup during bootstrap or decommission
> ---
>
> Key: CASSANDRA-16418
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16418
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Bootstrap and Decommission
>Reporter: James Baker
>Assignee: Lindsey Zurovchak
>Priority: Normal
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> What we expected: Running a cleanup is a safe operation; the result of 
> running a query after a cleanup should be the same as the result of running a 
> query before a cleanup.
> What actually happened: We ran a cleanup during a decommission. All the 
> streamed data was silently deleted, the bootstrap did not fail, the cluster's 
> data after the decommission was very different to the state before.
> Why: Cleanups do not take into account pending ranges and so the cleanup 
> thought that all the data that had just been streamed was redundant and so 
> deleted it. We think that this is symmetric with bootstraps, though have not 
> verified.
> Not sure if this is technically a bug but it was very surprising (and 
> seemingly undocumented) behaviour.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org