[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses

2015-05-21 Thread Erik Forsberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554143#comment-14554143
 ] 

Erik Forsberg commented on CASSANDRA-9183:
--

This patch applies cleanly on 2.0 and has greatly increased my cluster 
stability. So if you would consider inclusion into 2.0 that would be great.

 Failure detector should detect and ignore local pauses
 --

 Key: CASSANDRA-9183
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.2.0 beta 1, 2.1.6

 Attachments: 9183-v2.txt, 9183.txt


 A local node can be paused for many reasons such as GC, and if the pause is 
 long enough when it recovers it will think all the other nodes are dead until 
 it gossips, causing UAE to be thrown to clients trying to use it as a 
 coordinator.  Instead, the FD can track the current time, and if the gap 
 there becomes too large, skip marking the nodes down (reset the FD data 
 perhaps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses

2015-05-20 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553102#comment-14553102
 ] 

Brandon Williams commented on CASSANDRA-9183:
-

wasPaused was added simply to survive two rounds of interpret() on the same 
endpoint, but wasn't intended to cross endpoints at all.  That said, I think 
you're right and instead we'd have to do something like track it per-endpoint.  
Can you make a new ticket for this?

 Failure detector should detect and ignore local pauses
 --

 Key: CASSANDRA-9183
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.2.0 beta 1, 2.1.6

 Attachments: 9183-v2.txt, 9183.txt


 A local node can be paused for many reasons such as GC, and if the pause is 
 long enough when it recovers it will think all the other nodes are dead until 
 it gossips, causing UAE to be thrown to clients trying to use it as a 
 coordinator.  Instead, the FD can track the current time, and if the gap 
 there becomes too large, skip marking the nodes down (reset the FD data 
 perhaps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses

2015-05-20 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553097#comment-14553097
 ] 

sankalp kohli commented on CASSANDRA-9183:
--

[~brandon.williams] In the interpret method, I can see that you would not mark 
2 endpoints as down due to the way you are using the wasPaused variable. Why 
is that? 
The third endpoint will be marked as down after the pause.  

 Failure detector should detect and ignore local pauses
 --

 Key: CASSANDRA-9183
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.2.0 beta 1, 2.1.6

 Attachments: 9183-v2.txt, 9183.txt


 A local node can be paused for many reasons such as GC, and if the pause is 
 long enough when it recovers it will think all the other nodes are dead until 
 it gossips, causing UAE to be thrown to clients trying to use it as a 
 coordinator.  Instead, the FD can track the current time, and if the gap 
 there becomes too large, skip marking the nodes down (reset the FD data 
 perhaps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses

2015-05-20 Thread sankalp kohli (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553128#comment-14553128
 ] 

sankalp kohli commented on CASSANDRA-9183:
--

Created CASSANDRA-9446

 Failure detector should detect and ignore local pauses
 --

 Key: CASSANDRA-9183
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.2.0 beta 1, 2.1.6

 Attachments: 9183-v2.txt, 9183.txt


 A local node can be paused for many reasons such as GC, and if the pause is 
 long enough when it recovers it will think all the other nodes are dead until 
 it gossips, causing UAE to be thrown to clients trying to use it as a 
 coordinator.  Instead, the FD can track the current time, and if the gap 
 there becomes too large, skip marking the nodes down (reset the FD data 
 perhaps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses

2015-05-12 Thread Brandon Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541016#comment-14541016
 ] 

Brandon Williams commented on CASSANDRA-9183:
-

Done.

 Failure detector should detect and ignore local pauses
 --

 Key: CASSANDRA-9183
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.2 beta 1, 2.1.6

 Attachments: 9183-v2.txt, 9183.txt


 A local node can be paused for many reasons such as GC, and if the pause is 
 long enough when it recovers it will think all the other nodes are dead until 
 it gossips, causing UAE to be thrown to clients trying to use it as a 
 coordinator.  Instead, the FD can track the current time, and if the gap 
 there becomes too large, skip marking the nodes down (reset the FD data 
 perhaps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses

2015-05-12 Thread Richard Low (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14541008#comment-14541008
 ] 

Richard Low commented on CASSANDRA-9183:


Is it possible to get this in 2.1 too?

 Failure detector should detect and ignore local pauses
 --

 Key: CASSANDRA-9183
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 2.2 beta 1

 Attachments: 9183-v2.txt, 9183.txt


 A local node can be paused for many reasons such as GC, and if the pause is 
 long enough when it recovers it will think all the other nodes are dead until 
 it gossips, causing UAE to be thrown to clients trying to use it as a 
 coordinator.  Instead, the FD can track the current time, and if the gap 
 there becomes too large, skip marking the nodes down (reset the FD data 
 perhaps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses

2015-05-07 Thread Richard Low (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533030#comment-14533030
 ] 

Richard Low commented on CASSANDRA-9183:


+1. Very minor comment: it would be slightly clearer to set lastInterpret 
immediately after the diff calculation rather than in both cases.

 Failure detector should detect and ignore local pauses
 --

 Key: CASSANDRA-9183
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 3.x

 Attachments: 9183-v2.txt, 9183.txt


 A local node can be paused for many reasons such as GC, and if the pause is 
 long enough when it recovers it will think all the other nodes are dead until 
 it gossips, causing UAE to be thrown to clients trying to use it as a 
 coordinator.  Instead, the FD can track the current time, and if the gap 
 there becomes too large, skip marking the nodes down (reset the FD data 
 perhaps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-9183) Failure detector should detect and ignore local pauses

2015-04-21 Thread Erik Forsberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14504921#comment-14504921
 ] 

Erik Forsberg commented on CASSANDRA-9183:
--

As CASSANDRA-9218 was closed as a duplicate of this, I would like to add that 
I'm seeing a behaviour where the node that had a pause never recovers, you need 
to restart parts of your cluster to make it recover, as the gossip is waiting 
for an echo reply that never comes back, as network packets were dropped during 
the pause.

 Failure detector should detect and ignore local pauses
 --

 Key: CASSANDRA-9183
 URL: https://issues.apache.org/jira/browse/CASSANDRA-9183
 Project: Cassandra
  Issue Type: Improvement
  Components: Core
Reporter: Brandon Williams
Assignee: Brandon Williams
 Fix For: 3.0

 Attachments: 9183-v2.txt, 9183.txt


 A local node can be paused for many reasons such as GC, and if the pause is 
 long enough when it recovers it will think all the other nodes are dead until 
 it gossips, causing UAE to be thrown to clients trying to use it as a 
 coordinator.  Instead, the FD can track the current time, and if the gap 
 there becomes too large, skip marking the nodes down (reset the FD data 
 perhaps)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)