Re: Issue replacing a dead node

Courtney Fri, 16 May 2025 19:49:57 -0700

Is it bad to leave the replacement node up and running for hours evenwhen the cluster forgets it for the old node being replaced? I'll haveto set the logging to trace. debug produced nothing. I did stop theservice, which produced errors in the other nodes in the datacentersince they had open connections to the server still. The hardware isnew. A disk issue already would be odd, but perhaps not unsurprising ifthat was the case.


On 5/16/25 7:27 PM, Bowen Song via user wrote:

In my experience, failed bootstrap / node replacement always leavesome traces in the logs. At the very minimal, there's going to be logsabout streaming sessions failing or aborting. I have never seen itsilently fails or stops without leaving any traces in the log. I can'tthink of anything that can cause the process to fail and doesn't leavea trace in the log. BTW, the relevant logs can be hours before thesymptom becomes visible, because a failed streaming session does notcause Cassandra to immediately abort other active streaming sessions,and the remaining active sessions can take a while to complete.
If the process repeatedly fails at a certain place, I would suspectsome sort of data corruption or disk error, resulting in the datacannot be read or deserialised correctly. But this is just a guess,and I could be wrong.
On 16/05/2025 01:14, Courtney wrote:
I checked all the logs and really couldn't find anything. I couldn'tfind any sort of errors in dmesg, system.log, debug.log, gc.log(maybe up the log level?), systemd journal...the logs are totallyclean. It just stops gossiping all of a sudden at 22GB of data eachtime, then the old node returning to DN state. What is `nodetoolbootstrap resume` going to do? Is there a risk to running resume whenthe replacement node is no longer in the cluster? Could too high of atombstone ratio cause this?
On 5/15/25 5:08 PM, Bowen Song via user wrote:
The dead node being replaced went back to DN state indicating thenew replacement node failed to join the cluster, usually because thestreaming was interrupted (e.g. by network issues, or long STW GCpauses). I would start looking for red flags in the logs, includingCassandra's logs, GC logs, dmesg, systemd journal, etc., on the newnode, and other nodes in the cluster too. Also, I would try`nodetool bootstrap resume` on the replacement node.
On 12/05/2025 09:53, Courtney wrote:
Hello everyone,
I have a cluster with 2 datacenters. I am usingGossipingPropertyFileSnitch as my endpoint snitch. Cassandraversion 4.1.8. One datacenter is fully Ubuntu 24.04 and OpenJDK 11and another is Ubuntu 20.04 on OpenJDK 8. A seed node died in mysecond DC running Ubuntu 20.04 hosts. I ordered a new dedicatedserver. I updated my seeds to forget the dead seed node. I did thesteps to replace a dead node
JVM_OPTS="$JVM_OPTS $JVM_EXTRA_OPTS-Dcassandra.replace_address_first_boot=<dead_node_ip>"
Configs between the old/new node are identical minus IP addressesand that line above in the env file to replace the dead node. Istarted the node and it started replacing the old node and was inthe `UJ` state. Not long into the process, the new node stopsprocessing data and the cluster forgets the new node and remembersthe old one in its `DN` state (which is turned off, no power).There are no errors in the logs. I've tried different times hopingto solve the issue. I upped my ROOT logging level to DEBUG, I alsoset "org.apache.cassandra.gms.Gossiper TRACE". No errors.
With TRACE set for the Gossiper, I notice gossiping stops and datastopping streaming about the same time. I cannot run any nodetoolcommands on the new node. The process doesn't die, it leaves openconnections to nodes that are streaming data, but I don't see anydata streaming.
I've thought through a lot. Space isn't an issue, ulimits are sethigh in /etc/security/limits.conf. Checking /proc/<pid>/limitsshows the values are high. I've replaced nodes before like thiswithout issue, but this one is causing me grief. Is there anythingmore I can do?
Courtney

Re: Issue replacing a dead node

Reply via email to