[jira] [Updated] (CASSANDRA-14930) decommission may cause timeout because messaging backlog is cleared

2022-12-06 Thread Stefan Miklosovic (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Miklosovic updated CASSANDRA-14930:
--
Status: Patch Available  (was: Review In Progress)

> decommission may cause timeout because messaging backlog is cleared 
> 
>
> Key: CASSANDRA-14930
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14930
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Coordination, Legacy/Core
>Reporter: Zhao Yang
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
>
> On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
> timeout because messaging backlog to decommissioned node is cleared via 
> {{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
>  (Timeout is less likely to happen with RF=3, because we can afford one less 
> response)
> {code:java}
> What happened:
> 1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
> write endpoints are generated ( leaving endpoint is included )
> 2. [GossipStage] the leaving node is removed from tokenmetadata, no more 
> future write handler will include leaving endpoints
> 3. [WriteStage] write handlers sends messages to messaging-service backlog
> 4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
> and connection closed
> 5. [WriteStage] write time out
>  {code}
> |patch|
> |[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
> |[3.11|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|
> We can avoid it by delaying to destroy messaging connection so that messages 
> are sent and responded. This patch also avoids reopening already closed 
> connection on {{MessagingService#convict()}}.
>  New messaging framework rewrite in {{Trunk}} avoids the issues by not 
> clearing messaging backlog.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14930) decommission may cause timeout because messaging backlog is cleared

2022-12-06 Thread Stefan Miklosovic (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Miklosovic updated CASSANDRA-14930:
--
Status: In Progress  (was: Patch Available)

> decommission may cause timeout because messaging backlog is cleared 
> 
>
> Key: CASSANDRA-14930
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14930
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Coordination, Legacy/Core
>Reporter: Zhao Yang
>Assignee: Stefan Miklosovic
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
>
> On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
> timeout because messaging backlog to decommissioned node is cleared via 
> {{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
>  (Timeout is less likely to happen with RF=3, because we can afford one less 
> response)
> {code:java}
> What happened:
> 1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
> write endpoints are generated ( leaving endpoint is included )
> 2. [GossipStage] the leaving node is removed from tokenmetadata, no more 
> future write handler will include leaving endpoints
> 3. [WriteStage] write handlers sends messages to messaging-service backlog
> 4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
> and connection closed
> 5. [WriteStage] write time out
>  {code}
> |patch|
> |[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
> |[3.11|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|
> We can avoid it by delaying to destroy messaging connection so that messages 
> are sent and responded. This patch also avoids reopening already closed 
> connection on {{MessagingService#convict()}}.
>  New messaging framework rewrite in {{Trunk}} avoids the issues by not 
> clearing messaging backlog.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14930) decommission may cause timeout because messaging backlog is cleared

2022-12-01 Thread Stefan Miklosovic (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stefan Miklosovic updated CASSANDRA-14930:
--
Change Category: Operability
 Complexity: Normal
  Reviewers: Aleksei Zotov, Brandon Williams, Stefan Miklosovic  (was: 
Aleksei Zotov, Brandon Williams)
 Status: Review In Progress  (was: Needs Committer)

> decommission may cause timeout because messaging backlog is cleared 
> 
>
> Key: CASSANDRA-14930
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14930
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Coordination, Legacy/Core
>Reporter: Zhao Yang
>Assignee: Zhao Yang
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
>
> On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
> timeout because messaging backlog to decommissioned node is cleared via 
> {{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
>  (Timeout is less likely to happen with RF=3, because we can afford one less 
> response)
> {code:java}
> What happened:
> 1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
> write endpoints are generated ( leaving endpoint is included )
> 2. [GossipStage] the leaving node is removed from tokenmetadata, no more 
> future write handler will include leaving endpoints
> 3. [WriteStage] write handlers sends messages to messaging-service backlog
> 4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
> and connection closed
> 5. [WriteStage] write time out
>  {code}
> |patch|
> |[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
> |[3.11|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|
> We can avoid it by delaying to destroy messaging connection so that messages 
> are sent and responded. This patch also avoids reopening already closed 
> connection on {{MessagingService#convict()}}.
>  New messaging framework rewrite in {{Trunk}} avoids the issues by not 
> clearing messaging backlog.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14930) decommission may cause timeout because messaging backlog is cleared

2021-09-21 Thread Aleksei Zotov (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksei Zotov updated CASSANDRA-14930:
--
Reviewers: Aleksei Zotov, Brandon Williams  (was: Brandon Williams)

> decommission may cause timeout because messaging backlog is cleared 
> 
>
> Key: CASSANDRA-14930
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14930
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Coordination, Legacy/Core
>Reporter: Zhao Yang
>Assignee: Zhao Yang
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
>
> On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
> timeout because messaging backlog to decommissioned node is cleared via 
> {{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
>  (Timeout is less likely to happen with RF=3, because we can afford one less 
> response)
> {code:java}
> What happened:
> 1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
> write endpoints are generated ( leaving endpoint is included )
> 2. [GossipStage] the leaving node is removed from tokenmetadata, no more 
> future write handler will include leaving endpoints
> 3. [WriteStage] write handlers sends messages to messaging-service backlog
> 4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
> and connection closed
> 5. [WriteStage] write time out
>  {code}
> |patch|
> |[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
> |[3.11|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|
> We can avoid it by delaying to destroy messaging connection so that messages 
> are sent and responded. This patch also avoids reopening already closed 
> connection on {{MessagingService#convict()}}.
>  New messaging framework rewrite in {{Trunk}} avoids the issues by not 
> clearing messaging backlog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14930) decommission may cause timeout because messaging backlog is cleared

2021-07-12 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-14930:
-
Status: Needs Reviewer  (was: Review In Progress)

> decommission may cause timeout because messaging backlog is cleared 
> 
>
> Key: CASSANDRA-14930
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14930
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Coordination, Legacy/Core
>Reporter: Zhao Yang
>Assignee: Zhao Yang
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
>
> On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
> timeout because messaging backlog to decommissioned node is cleared via 
> {{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
>  (Timeout is less likely to happen with RF=3, because we can afford one less 
> response)
> {code:java}
> What happened:
> 1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
> write endpoints are generated ( leaving endpoint is included )
> 2. [GossipStage] the leaving node is removed from tokenmetadata, no more 
> future write handler will include leaving endpoints
> 3. [WriteStage] write handlers sends messages to messaging-service backlog
> 4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
> and connection closed
> 5. [WriteStage] write time out
>  {code}
> |patch|
> |[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
> |[3.11|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|
> We can avoid it by delaying to destroy messaging connection so that messages 
> are sent and responded. This patch also avoids reopening already closed 
> connection on {{MessagingService#convict()}}.
>  New messaging framework rewrite in {{Trunk}} avoids the issues by not 
> clearing messaging backlog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14930) decommission may cause timeout because messaging backlog is cleared

2021-06-28 Thread Brandon Williams (Jira)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brandon Williams updated CASSANDRA-14930:
-
Reviewers: Brandon Williams, Brandon Williams  (was: Brandon Williams)
   Brandon Williams, Brandon Williams
   Status: Review In Progress  (was: Patch Available)

> decommission may cause timeout because messaging backlog is cleared 
> 
>
> Key: CASSANDRA-14930
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14930
> Project: Cassandra
>  Issue Type: Bug
>  Components: Legacy/Coordination, Legacy/Core
>Reporter: Zhao Yang
>Assignee: Zhao Yang
>Priority: Normal
> Fix For: 3.0.x, 3.11.x
>
>
> On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
> timeout because messaging backlog to decommissioned node is cleared via 
> {{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
>  (Timeout is less likely to happen with RF=3, because we can afford one less 
> response)
> {code:java}
> What happened:
> 1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
> write endpoints are generated ( leaving endpoint is included )
> 2. [GossipStage] the leaving node is removed from tokenmetadata, no more 
> future write handler will include leaving endpoints
> 3. [WriteStage] write handlers sends messages to messaging-service backlog
> 4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
> and connection closed
> 5. [WriteStage] write time out
>  {code}
> |patch|
> |[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
> |[3.11|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|
> We can avoid it by delaying to destroy messaging connection so that messages 
> are sent and responded. This patch also avoids reopening already closed 
> connection on {{MessagingService#convict()}}.
>  New messaging framework rewrite in {{Trunk}} avoids the issues by not 
> clearing messaging backlog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14930) decommission may cause timeout because messaging backlog is cleared

2018-12-14 Thread ZhaoYang (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-14930:
-
Status: Patch Available  (was: Open)

> decommission may cause timeout because messaging backlog is cleared 
> 
>
> Key: CASSANDRA-14930
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14930
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination, Core
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Major
> Fix For: 3.0.x, 3.11.x
>
>
> On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
> timeout because messaging backlog to decommissioned node is cleared via 
> {{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
>  (Timeout is less likely to happen with RF=3, because we can afford one less 
> response)
> {code:java}
> What happened:
> 1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
> write endpoints are generated ( leaving endpoint is included )
> 2. [GossipStage] the leaving node is removed from tokenmetadata, no more 
> future write handler will include leaving endpoints
> 3. [WriteStage] write handlers sends messages to messaging-service backlog
> 4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
> and connection closed
> 5. [WriteStage] write time out
>  {code}
> |patch|
> |[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
> |[3.11|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|
> We can avoid it by delaying to destroy messaging connection so that messages 
> are sent and responded. This patch also avoids reopening already closed 
> connection on {{MessagingService#convict()}}.
>  New messaging framework rewrite in {{Trunk}} avoids the issues by not 
> clearing messaging backlog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14930) decommission may cause timeout because messaging backlog is cleared

2018-12-11 Thread ZhaoYang (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-14930:
-
Description: 
On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
timeout because messaging backlog to decommissioned node is cleared via 
{{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
(Timeout is less likely to happen with RF=3, because we can afford one less 
response)

{code:java}
What happened:
1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
write endpoints are generated ( leaving endpoint is included )
2. [GossipStage] the leaving node is removed from tokenmetadata, no more future 
write handler will include leaving endpoints
3. [WriteStage] write handlers sends messages to messaging-service backlog
4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
and connection closed
5. [WriteStage] write time out
 {code}


| patch |
| 
[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]  
|
| 
[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11] 
 |

We can avoid it by delaying to destroy messaging connection so that messages 
are sent and responded. This patch also avoids reopen already closed connection 
on {{MessagingService#convict()}}.
New messaging framework rewrite in {{Trunk}} avoids the issues by not clearing 
messaging backlog.


  was:
On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
timeout because messaging backlog to decommissioned node is cleared via 
{{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
(Timeout is less likely to happen with RF=3, because we can afford one less 
response)

{code:java}
What happened:
1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
write endpoints are generated ( leaving endpoint is included )
2. [GossipStage] the leaving node is removed from tokenmetadata, no more future 
write handler will include leaving endpoints
3. [WriteStage] write handlers sends messages to messaging-service backlog
4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
and connection closed
5. [WriteStage] write time out
 {code}


| patch |
| 
[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]  
|
| 
[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11] 
 |

We can avoid it by delaying to destroy messaging connection so that messages 
are sent and responded. New messaging framework rewrite in {{Trunk}} avoids the 
issues by not clearing messaging backlog.



> decommission may cause timeout because messaging backlog is cleared 
> 
>
> Key: CASSANDRA-14930
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14930
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination, Core
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Major
> Fix For: 3.0.x, 3.11.x
>
>
> On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
> timeout because messaging backlog to decommissioned node is cleared via 
> {{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
> (Timeout is less likely to happen with RF=3, because we can afford one less 
> response)
> {code:java}
> What happened:
> 1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
> write endpoints are generated ( leaving endpoint is included )
> 2. [GossipStage] the leaving node is removed from tokenmetadata, no more 
> future write handler will include leaving endpoints
> 3. [WriteStage] write handlers sends messages to messaging-service backlog
> 4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
> and connection closed
> 5. [WriteStage] write time out
>  {code}
> | patch |
> | 
> [3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]
>   |
> | 
> [3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]
>   |
> We can avoid it by delaying to destroy messaging connection so that messages 
> are sent and responded. This patch also avoids reopen already closed 
> connection on {{MessagingService#convict()}}.
> New messaging framework rewrite in {{Trunk}} avoids the issues by not 
> clearing messaging backlog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14930) decommission may cause timeout because messaging backlog is cleared

2018-12-11 Thread ZhaoYang (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-14930:
-
Description: 
On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
timeout because messaging backlog to decommissioned node is cleared via 
{{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
 (Timeout is less likely to happen with RF=3, because we can afford one less 
response)
{code:java}
What happened:
1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
write endpoints are generated ( leaving endpoint is included )
2. [GossipStage] the leaving node is removed from tokenmetadata, no more future 
write handler will include leaving endpoints
3. [WriteStage] write handlers sends messages to messaging-service backlog
4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
and connection closed
5. [WriteStage] write time out
 {code}
|patch|
|[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
|[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|

We can avoid it by delaying to destroy messaging connection so that messages 
are sent and responded. This patch also avoids reopening already closed 
connection on {{MessagingService#convict()}}.
 New messaging framework rewrite in {{Trunk}} avoids the issues by not clearing 
messaging backlog.

  was:
On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
timeout because messaging backlog to decommissioned node is cleared via 
{{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
(Timeout is less likely to happen with RF=3, because we can afford one less 
response)

{code:java}
What happened:
1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
write endpoints are generated ( leaving endpoint is included )
2. [GossipStage] the leaving node is removed from tokenmetadata, no more future 
write handler will include leaving endpoints
3. [WriteStage] write handlers sends messages to messaging-service backlog
4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
and connection closed
5. [WriteStage] write time out
 {code}


| patch |
| 
[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]  
|
| 
[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11] 
 |

We can avoid it by delaying to destroy messaging connection so that messages 
are sent and responded. This patch also avoids reopen already closed connection 
on {{MessagingService#convict()}}.
New messaging framework rewrite in {{Trunk}} avoids the issues by not clearing 
messaging backlog.



> decommission may cause timeout because messaging backlog is cleared 
> 
>
> Key: CASSANDRA-14930
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14930
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination, Core
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Major
> Fix For: 3.0.x, 3.11.x
>
>
> On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
> timeout because messaging backlog to decommissioned node is cleared via 
> {{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
>  (Timeout is less likely to happen with RF=3, because we can afford one less 
> response)
> {code:java}
> What happened:
> 1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
> write endpoints are generated ( leaving endpoint is included )
> 2. [GossipStage] the leaving node is removed from tokenmetadata, no more 
> future write handler will include leaving endpoints
> 3. [WriteStage] write handlers sends messages to messaging-service backlog
> 4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
> and connection closed
> 5. [WriteStage] write time out
>  {code}
> |patch|
> |[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
> |[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|
> We can avoid it by delaying to destroy messaging connection so that messages 
> are sent and responded. This patch also avoids reopening already closed 
> connection on {{MessagingService#convict()}}.
>  New messaging framework rewrite in {{Trunk}} avoids the issues by not 
> clearing messaging backlog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Updated] (CASSANDRA-14930) decommission may cause timeout because messaging backlog is cleared

2018-12-11 Thread ZhaoYang (JIRA)


 [ 
https://issues.apache.org/jira/browse/CASSANDRA-14930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ZhaoYang updated CASSANDRA-14930:
-
Description: 
On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
timeout because messaging backlog to decommissioned node is cleared via 
{{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
 (Timeout is less likely to happen with RF=3, because we can afford one less 
response)
{code:java}
What happened:
1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
write endpoints are generated ( leaving endpoint is included )
2. [GossipStage] the leaving node is removed from tokenmetadata, no more future 
write handler will include leaving endpoints
3. [WriteStage] write handlers sends messages to messaging-service backlog
4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
and connection closed
5. [WriteStage] write time out
 {code}
|patch|
|[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
|[3.11|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|

We can avoid it by delaying to destroy messaging connection so that messages 
are sent and responded. This patch also avoids reopening already closed 
connection on {{MessagingService#convict()}}.
 New messaging framework rewrite in {{Trunk}} avoids the issues by not clearing 
messaging backlog.

  was:
On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
timeout because messaging backlog to decommissioned node is cleared via 
{{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
 (Timeout is less likely to happen with RF=3, because we can afford one less 
response)
{code:java}
What happened:
1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
write endpoints are generated ( leaving endpoint is included )
2. [GossipStage] the leaving node is removed from tokenmetadata, no more future 
write handler will include leaving endpoints
3. [WriteStage] write handlers sends messages to messaging-service backlog
4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
and connection closed
5. [WriteStage] write time out
 {code}
|patch|
|[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
|[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|

We can avoid it by delaying to destroy messaging connection so that messages 
are sent and responded. This patch also avoids reopening already closed 
connection on {{MessagingService#convict()}}.
 New messaging framework rewrite in {{Trunk}} avoids the issues by not clearing 
messaging backlog.


> decommission may cause timeout because messaging backlog is cleared 
> 
>
> Key: CASSANDRA-14930
> URL: https://issues.apache.org/jira/browse/CASSANDRA-14930
> Project: Cassandra
>  Issue Type: Bug
>  Components: Coordination, Core
>Reporter: ZhaoYang
>Assignee: ZhaoYang
>Priority: Major
> Fix For: 3.0.x, 3.11.x
>
>
> On a 3-node cluster with RF=2, decommissioning a node may cause quorum write 
> timeout because messaging backlog to decommissioned node is cleared via 
> {{Gossiper#removeEndpoint() -> OutboundTcpConnection#closeSocket()}}.
>  (Timeout is less likely to happen with RF=3, because we can afford one less 
> response)
> {code:java}
> What happened:
> 1. [WriteStage] before the leaving node is removed from tokenmetadata, the 
> write endpoints are generated ( leaving endpoint is included )
> 2. [GossipStage] the leaving node is removed from tokenmetadata, no more 
> future write handler will include leaving endpoints
> 3. [WriteStage] write handlers sends messages to messaging-service backlog
> 4. [GossipStage] messaging-service backlog is cleared, messages are not sent 
> and connection closed
> 5. [WriteStage] write time out
>  {code}
> |patch|
> |[3.0|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.0]|
> |[3.11|https://github.com/jasonstack/cassandra/commits/decommission_timeout_3.11]|
> We can avoid it by delaying to destroy messaging connection so that messages 
> are sent and responded. This patch also avoids reopening already closed 
> connection on {{MessagingService#convict()}}.
>  New messaging framework rewrite in {{Trunk}} avoids the issues by not 
> clearing messaging backlog.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org