[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses
[ https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554143#comment-14554143 ] Erik Forsberg commented on CASSANDRA-9183: -- This patch applies cleanly on 2.0 and has greatly increased my cluster stability. So if you would consider inclusion into 2.0 that would be great. Failure detector should detect and ignore local pauses -- Key: CASSANDRA-9183 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.2.0 beta 1, 2.1.6 Attachments: 9183-v2.txt, 9183.txt A local node can be paused for many reasons such as GC, and if the pause is long enough when it recovers it will think all the other nodes are dead until it gossips, causing UAE to be thrown to clients trying to use it as a coordinator. Instead, the FD can track the current time, and if the gap there becomes too large, skip marking the nodes down (reset the FD data perhaps) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses
[ https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553102#comment-14553102 ] Brandon Williams commented on CASSANDRA-9183: - wasPaused was added simply to survive two rounds of interpret() on the same endpoint, but wasn't intended to cross endpoints at all. That said, I think you're right and instead we'd have to do something like track it per-endpoint. Can you make a new ticket for this? Failure detector should detect and ignore local pauses -- Key: CASSANDRA-9183 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.2.0 beta 1, 2.1.6 Attachments: 9183-v2.txt, 9183.txt A local node can be paused for many reasons such as GC, and if the pause is long enough when it recovers it will think all the other nodes are dead until it gossips, causing UAE to be thrown to clients trying to use it as a coordinator. Instead, the FD can track the current time, and if the gap there becomes too large, skip marking the nodes down (reset the FD data perhaps) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses
[ https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553097#comment-14553097 ] sankalp kohli commented on CASSANDRA-9183: -- [~brandon.williams] In the interpret method, I can see that you would not mark 2 endpoints as down due to the way you are using the wasPaused variable. Why is that? The third endpoint will be marked as down after the pause. Failure detector should detect and ignore local pauses -- Key: CASSANDRA-9183 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.2.0 beta 1, 2.1.6 Attachments: 9183-v2.txt, 9183.txt A local node can be paused for many reasons such as GC, and if the pause is long enough when it recovers it will think all the other nodes are dead until it gossips, causing UAE to be thrown to clients trying to use it as a coordinator. Instead, the FD can track the current time, and if the gap there becomes too large, skip marking the nodes down (reset the FD data perhaps) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses
[ https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553128#comment-14553128 ] sankalp kohli commented on CASSANDRA-9183: -- Created CASSANDRA-9446 Failure detector should detect and ignore local pauses -- Key: CASSANDRA-9183 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.2.0 beta 1, 2.1.6 Attachments: 9183-v2.txt, 9183.txt A local node can be paused for many reasons such as GC, and if the pause is long enough when it recovers it will think all the other nodes are dead until it gossips, causing UAE to be thrown to clients trying to use it as a coordinator. Instead, the FD can track the current time, and if the gap there becomes too large, skip marking the nodes down (reset the FD data perhaps) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses
[ https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541016#comment-14541016 ] Brandon Williams commented on CASSANDRA-9183: - Done. Failure detector should detect and ignore local pauses -- Key: CASSANDRA-9183 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.2 beta 1, 2.1.6 Attachments: 9183-v2.txt, 9183.txt A local node can be paused for many reasons such as GC, and if the pause is long enough when it recovers it will think all the other nodes are dead until it gossips, causing UAE to be thrown to clients trying to use it as a coordinator. Instead, the FD can track the current time, and if the gap there becomes too large, skip marking the nodes down (reset the FD data perhaps) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses
[ https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541008#comment-14541008 ] Richard Low commented on CASSANDRA-9183: Is it possible to get this in 2.1 too? Failure detector should detect and ignore local pauses -- Key: CASSANDRA-9183 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 2.2 beta 1 Attachments: 9183-v2.txt, 9183.txt A local node can be paused for many reasons such as GC, and if the pause is long enough when it recovers it will think all the other nodes are dead until it gossips, causing UAE to be thrown to clients trying to use it as a coordinator. Instead, the FD can track the current time, and if the gap there becomes too large, skip marking the nodes down (reset the FD data perhaps) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses
[ https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533030#comment-14533030 ] Richard Low commented on CASSANDRA-9183: +1. Very minor comment: it would be slightly clearer to set lastInterpret immediately after the diff calculation rather than in both cases. Failure detector should detect and ignore local pauses -- Key: CASSANDRA-9183 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 3.x Attachments: 9183-v2.txt, 9183.txt A local node can be paused for many reasons such as GC, and if the pause is long enough when it recovers it will think all the other nodes are dead until it gossips, causing UAE to be thrown to clients trying to use it as a coordinator. Instead, the FD can track the current time, and if the gap there becomes too large, skip marking the nodes down (reset the FD data perhaps) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses
[ https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504921#comment-14504921 ] Erik Forsberg commented on CASSANDRA-9183: -- As CASSANDRA-9218 was closed as a duplicate of this, I would like to add that I'm seeing a behaviour where the node that had a pause never recovers, you need to restart parts of your cluster to make it recover, as the gossip is waiting for an echo reply that never comes back, as network packets were dropped during the pause. Failure detector should detect and ignore local pauses -- Key: CASSANDRA-9183 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183 Project: Cassandra Issue Type: Improvement Components: Core Reporter: Brandon Williams Assignee: Brandon Williams Fix For: 3.0 Attachments: 9183-v2.txt, 9183.txt A local node can be paused for many reasons such as GC, and if the pause is long enough when it recovers it will think all the other nodes are dead until it gossips, causing UAE to be thrown to clients trying to use it as a coordinator. Instead, the FD can track the current time, and if the gap there becomes too large, skip marking the nodes down (reset the FD data perhaps) -- This message was sent by Atlassian JIRA (v6.3.4#6332)