[ https://issues.apache.org/jira/browse/CASSANDRA-9218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jonathan Ellis resolved CASSANDRA-9218. --------------------------------------- Resolution: Duplicate > Node thinks other nodes are down after heavy GC > ----------------------------------------------- > > Key: CASSANDRA-9218 > URL: https://issues.apache.org/jira/browse/CASSANDRA-9218 > Project: Cassandra > Issue Type: Bug > Reporter: Erik Forsberg > > I have a few troublesome nodes which often end up doing very long GC pauses. > The root cause of this is yet to be found, but it's causing another problem - > the affected node(s) mark other nodes as down, and they never recover. > Here's how it goes: > 1. Node goes into troublesome mode, doing heavy GC with long (10+ seconds) GC > pauses. > 2. While this happens, node will mark other nodes as down. > 3. Once the overload situation resolves, the node still thinks the other > nodes are down (they are not). It's also quite common that other nodes think > the affected node is down. > So we often end up with node A thinking there's some 30 nodes down, then a > bunch of other nodes beliving node A is down. This in a cluster with 56 > nodes. > The only way to get out of the situation is to restart node A, and sometimes > a few other nodes. And while node A is in this state, any queries that use > node A as coordinator have a high risk of getting errors about not enough > replicas being available. > I have enabled TRACE level gossip debugging while this happens, and on node > A, there will be multiple messages about, "has already a pending echo, > skipping it" - i.e the debug line in Gossiper.java line 882. > I have also observed while this was happening that other nodes were trying to > establish connections (SYN packets sent) but the trouble node (A) were not > picking up the line (no accept()). > Not knowing exactly how Gossiper works here but it looks like node A is > sending out some gossiper echo messages, but then is too busy to get the > replies, and never retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)