[jira] [Commented] (CASSANDRA-15027) Handle IR prepare phase failures less race prone by waiting for all results

2019-02-22 Thread Stefan Podkowinski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775437#comment-16775437
 ] 

Stefan Podkowinski commented on CASSANDRA-15027:


LGTM +1

> Handle IR prepare phase failures less race prone by waiting for all results
> ---
>
> Key: CASSANDRA-15027
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15027
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Local/Compaction
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Major
> Fix For: 4.x
>
>
> Handling incremental repairs as a coordinator begins by sending a 
> {{PrepareConsistentRequest}} message to all participants, which may also 
> include the coordinator itself. Participants will run anti-compactions upon 
> receiving such a message and report the result of the operation back to the 
> coordinator.
> Once we receive a failure response from any of the participants, we fail-fast 
> in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn 
> completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then 
> the repair command will terminate with an error status, as expected.
> The issue is that in case the node will both be coordinator and participant, 
> we may end up with a local session and submitted anti-compactions, which will 
> be executed without any coordination with the coordinator session (on same 
> node). This may result in situations where running repair commands right 
> after another, may cause overlapping execution of anti-compactions that will 
> cause the following (misleading) message to show up in the logs and will 
> cause the repair to fail again:
>  "Prepare phase for incremental repair session %s has failed because it 
> encountered intersecting sstables belonging to another incremental repair 
> session (%s). This is by starting an incremental repair session before a 
> previous one has completed. Check nodetool repair_admin for hung sessions and 
> fix them."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15027) Handle IR prepare phase failures less race prone by waiting for all results

2019-02-22 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775361#comment-16775361
 ] 

Blake Eggleston commented on CASSANDRA-15027:
-

Nice. Your follow on changes look good to me, I have 2 nits, but those can just 
be fixed on commit.
 
* We should log the session id in compaction manager when an anti-compaction is 
cancelled (and probably when there's an error as well)
* Some error handling should be added to the commit fixing the race between 
proposeFuture and hasFailure so nodetool doesn't hang if there's an error in 
the callback

> Handle IR prepare phase failures less race prone by waiting for all results
> ---
>
> Key: CASSANDRA-15027
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15027
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Local/Compaction
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Major
> Fix For: 4.x
>
>
> Handling incremental repairs as a coordinator begins by sending a 
> {{PrepareConsistentRequest}} message to all participants, which may also 
> include the coordinator itself. Participants will run anti-compactions upon 
> receiving such a message and report the result of the operation back to the 
> coordinator.
> Once we receive a failure response from any of the participants, we fail-fast 
> in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn 
> completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then 
> the repair command will terminate with an error status, as expected.
> The issue is that in case the node will both be coordinator and participant, 
> we may end up with a local session and submitted anti-compactions, which will 
> be executed without any coordination with the coordinator session (on same 
> node). This may result in situations where running repair commands right 
> after another, may cause overlapping execution of anti-compactions that will 
> cause the following (misleading) message to show up in the logs and will 
> cause the repair to fail again:
>  "Prepare phase for incremental repair session %s has failed because it 
> encountered intersecting sstables belonging to another incremental repair 
> session (%s). This is by starting an incremental repair session before a 
> previous one has completed. Check nodetool repair_admin for hung sessions and 
> fix them."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15027) Handle IR prepare phase failures less race prone by waiting for all results

2019-02-22 Thread Stefan Podkowinski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16775178#comment-16775178
 ] 

Stefan Podkowinski commented on CASSANDRA-15027:


Your updates look like valuable improvement over the initial patch. I'm +1 in 
general as for the changes, but also fixed some additional minor issues and 
added a new tests:

* 
[CASSANDRA-15027|https://github.com/spodkowinski/cassandra/commits/CASSANDRA-15027]
* [circleci|https://circleci.com/gh/spodkowinski/cassandra/653]

Please see comments with each commit in branch above for details.

Also happy to discuss any of the changes (most likely the last commit) in 
another jira, if you feel it's out of scope for this ticket.
 

> Handle IR prepare phase failures less race prone by waiting for all results
> ---
>
> Key: CASSANDRA-15027
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15027
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Local/Compaction
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Major
> Fix For: 4.x
>
>
> Handling incremental repairs as a coordinator begins by sending a 
> {{PrepareConsistentRequest}} message to all participants, which may also 
> include the coordinator itself. Participants will run anti-compactions upon 
> receiving such a message and report the result of the operation back to the 
> coordinator.
> Once we receive a failure response from any of the participants, we fail-fast 
> in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn 
> completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then 
> the repair command will terminate with an error status, as expected.
> The issue is that in case the node will both be coordinator and participant, 
> we may end up with a local session and submitted anti-compactions, which will 
> be executed without any coordination with the coordinator session (on same 
> node). This may result in situations where running repair commands right 
> after another, may cause overlapping execution of anti-compactions that will 
> cause the following (misleading) message to show up in the logs and will 
> cause the repair to fail again:
>  "Prepare phase for incremental repair session %s has failed because it 
> encountered intersecting sstables belonging to another incremental repair 
> session (%s). This is by starting an incremental repair session before a 
> previous one has completed. Check nodetool repair_admin for hung sessions and 
> fix them."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15027) Handle IR prepare phase failures less race prone by waiting for all results

2019-02-21 Thread Blake Eggleston (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16774515#comment-16774515
 ] 

Blake Eggleston commented on CASSANDRA-15027:
-

Thanks [~spo...@gmail.com]. I’ve extended your code so that in addition to 
waiting for other anti-compactions to complete, the coordinator also 
pro-actively cancels ongoing anti-compactions on the other participants. This 
avoids wasting time waiting for anti-compactions on other machines. The code 
does 3 things:
 * Adds a session state check to the {{isStopRequested}} method in the 
anti-compaction iterator.
 * The coordinator now sends failure messages to all participants when it 
receives a failure message from one of them in the prepare phase. It does not 
mark these participants as having failed internally though, since that would 
cause the nodetool session to immediately complete. Instead, it waits until 
it’s received messages from all the other nodes.
 * The participants will now respond with a failed prepare message if the 
anti-compaction completes, but the session was failed in the mean time. This 
prevents a dead lock on the coordinator in the case where the participant 
received a failure message between the time the anti-compaction completes and 
the callback fires.

Let me know what you think. If everything looks ok to you, I’m +1 on committing.

[trunk|https://github.com/bdeggleston/cassandra/tree/15027-trunk]
 
[circle|https://circleci.com/gh/bdeggleston/workflows/cassandra/tree/15027-trunk]

> Handle IR prepare phase failures less race prone by waiting for all results
> ---
>
> Key: CASSANDRA-15027
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15027
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Local/Compaction
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Major
> Fix For: 4.x
>
>
> Handling incremental repairs as a coordinator begins by sending a 
> {{PrepareConsistentRequest}} message to all participants, which may also 
> include the coordinator itself. Participants will run anti-compactions upon 
> receiving such a message and report the result of the operation back to the 
> coordinator.
> Once we receive a failure response from any of the participants, we fail-fast 
> in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn 
> completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then 
> the repair command will terminate with an error status, as expected.
> The issue is that in case the node will both be coordinator and participant, 
> we may end up with a local session and submitted anti-compactions, which will 
> be executed without any coordination with the coordinator session (on same 
> node). This may result in situations where running repair commands right 
> after another, may cause overlapping execution of anti-compactions that will 
> cause the following (misleading) message to show up in the logs and will 
> cause the repair to fail again:
>  "Prepare phase for incremental repair session %s has failed because it 
> encountered intersecting sstables belonging to another incremental repair 
> session (%s). This is by starting an incremental repair session before a 
> previous one has completed. Check nodetool repair_admin for hung sessions and 
> fix them."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15027) Handle IR prepare phase failures less race prone by waiting for all results

2019-02-18 Thread Stefan Podkowinski (JIRA)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16770869#comment-16770869
 ] 

Stefan Podkowinski commented on CASSANDRA-15027:


* [ [trunk|https://github.com/spodkowinski/cassandra/tree/CASSANDRA-15027] ][ 
[circleci|https://circleci.com/workflow-run/2b027f87-cf45-48ee-8eae-45a563701bc6]
 ]

> Handle IR prepare phase failures less race prone by waiting for all results
> ---
>
> Key: CASSANDRA-15027
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15027
> Project: Cassandra
>  Issue Type: Bug
>  Components: Consistency/Repair, Local/Compaction
>Reporter: Stefan Podkowinski
>Assignee: Stefan Podkowinski
>Priority: Major
> Fix For: 4.x
>
>
> Handling incremental repairs as a coordinator begins by sending a 
> {{PrepareConsistentRequest}} message to all participants, which may also 
> include the coordinator itself. Participants will run anti-compactions upon 
> receiving such a message and report the result of the operation back to the 
> coordinator.
> Once we receive a failure response from any of the participants, we fail-fast 
> in {{CoordinatorSession.handlePrepareResponse()}}, which will in turn 
> completes the {{prepareFuture}} that {{RepairRunnable}} is blocking on. Then 
> the repair command will terminate with an error status, as expected.
> The issue is that in case the node will both be coordinator and participant, 
> we may end up with a local session and submitted anti-compactions, which will 
> be executed without any coordination with the coordinator session (on same 
> node). This may result in situations where running repair commands right 
> after another, may cause overlapping execution of anti-compactions that will 
> cause the following (misleading) message to show up in the logs and will 
> cause the repair to fail again:
>  "Prepare phase for incremental repair session %s has failed because it 
> encountered intersecting sstables belonging to another incremental repair 
> session (%s). This is by starting an incremental repair session before a 
> previous one has completed. Check nodetool repair_admin for hung sessions and 
> fix them."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org