[ 
https://issues.apache.org/jira/browse/CASSANDRA-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis resolved CASSANDRA-9218.
---------------------------------------
    Resolution: Duplicate

> Node thinks other nodes are down after heavy GC
> -----------------------------------------------
>
>                 Key: CASSANDRA-9218
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9218
>             Project: Cassandra
>          Issue Type: Bug
>            Reporter: Erik Forsberg
>
> I have a few troublesome nodes which often end up doing very long GC pauses. 
> The root cause of this is yet to be found, but it's causing another problem - 
> the affected node(s) mark other nodes as down, and they never recover.
> Here's how it goes:
> 1. Node goes into troublesome mode, doing heavy GC with long (10+ seconds) GC 
> pauses.
> 2. While this happens, node will mark other nodes as down.
> 3. Once the overload situation resolves, the node still thinks the other 
> nodes are down (they are not). It's also quite common that other nodes think 
> the affected node is down.
> So we often end up with node A thinking there's some 30 nodes down, then a 
> bunch of other nodes beliving node A is down. This in a cluster with 56 
> nodes. 
> The only way to get out of the situation is to restart node A, and sometimes 
> a few other nodes. And while node A is in this state, any queries that use 
> node A as coordinator have a high risk of getting errors about not enough 
> replicas being available. 
> I have enabled TRACE level gossip debugging while this happens, and on node 
> A, there will be multiple messages about, "has already a pending echo, 
> skipping it" - i.e the debug line in Gossiper.java line 882.
> I have also observed while this was happening that other nodes were trying to 
> establish connections (SYN packets sent) but the trouble node (A) were not 
> picking up the line (no accept()).
> Not knowing exactly how Gossiper works here but it looks like node A is 
> sending out some gossiper echo messages, but then is too busy to get the 
> replies, and never retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to