Re: Issue replacing a dead node

Courtney Thu, 15 May 2025 17:14:33 -0700

I checked all the logs and really couldn't find anything. I couldn'tfind any sort of errors in dmesg, system.log, debug.log, gc.log (maybeup the log level?), systemd journal...the logs are totally clean. Itjust stops gossiping all of a sudden at 22GB of data each time, then theold node returning to DN state. What is `nodetool bootstrap resume`going to do? Is there a risk to running resume when the replacement nodeis no longer in the cluster? Could too high of a tombstone ratio cause this?


On 5/15/25 5:08 PM, Bowen Song via user wrote:

The dead node being replaced went back to DN state indicating the newreplacement node failed to join the cluster, usually because thestreaming was interrupted (e.g. by network issues, or long STW GCpauses). I would start looking for red flags in the logs, includingCassandra's logs, GC logs, dmesg, systemd journal, etc., on the newnode, and other nodes in the cluster too. Also, I would try `nodetoolbootstrap resume` on the replacement node.
On 12/05/2025 09:53, Courtney wrote:
Hello everyone,
I have a cluster with 2 datacenters. I am usingGossipingPropertyFileSnitch as my endpoint snitch. Cassandra version4.1.8. One datacenter is fully Ubuntu 24.04 and OpenJDK 11 andanother is Ubuntu 20.04 on OpenJDK 8. A seed node died in my secondDC running Ubuntu 20.04 hosts. I ordered a new dedicated server. Iupdated my seeds to forget the dead seed node. I did the steps toreplace a dead node
JVM_OPTS="$JVM_OPTS $JVM_EXTRA_OPTS-Dcassandra.replace_address_first_boot=<dead_node_ip>"
Configs between the old/new node are identical minus IP addresses andthat line above in the env file to replace the dead node. I startedthe node and it started replacing the old node and was in the `UJ`state. Not long into the process, the new node stops processing dataand the cluster forgets the new node and remembers the old one in its`DN` state (which is turned off, no power). There are no errors inthe logs. I've tried different times hoping to solve the issue. Iupped my ROOT logging level to DEBUG, I also set"org.apache.cassandra.gms.Gossiper TRACE". No errors.
With TRACE set for the Gossiper, I notice gossiping stops and datastopping streaming about the same time. I cannot run any nodetoolcommands on the new node. The process doesn't die, it leaves openconnections to nodes that are streaming data, but I don't see anydata streaming.
I've thought through a lot. Space isn't an issue, ulimits are sethigh in /etc/security/limits.conf. Checking /proc/<pid>/limits showsthe values are high. I've replaced nodes before like this withoutissue, but this one is causing me grief. Is there anything more I cando?
Courtney

Re: Issue replacing a dead node

Reply via email to