Re: Issue replacing a dead node

Bowen Song via user Fri, 16 May 2025 19:37:06 -0700

In my experience, failed bootstrap / node replacement always leave sometraces in the logs. At the very minimal, there's going to be logs aboutstreaming sessions failing or aborting. I have never seen it silentlyfails or stops without leaving any traces in the log. I can't think ofanything that can cause the process to fail and doesn't leave a trace inthe log. BTW, the relevant logs can be hours before the symptom becomesvisible, because a failed streaming session does not cause Cassandra toimmediately abort other active streaming sessions, and the remainingactive sessions can take a while to complete.

If the process repeatedly fails at a certain place, I would suspect somesort of data corruption or disk error, resulting in the data cannot beread or deserialised correctly. But this is just a guess, and I could bewrong.


On 16/05/2025 01:14, Courtney wrote:

I checked all the logs and really couldn't find anything. I couldn'tfind any sort of errors in dmesg, system.log, debug.log, gc.log (maybeup the log level?), systemd journal...the logs are totally clean. Itjust stops gossiping all of a sudden at 22GB of data each time, thenthe old node returning to DN state. What is `nodetool bootstrapresume` going to do? Is there a risk to running resume when thereplacement node is no longer in the cluster? Could too high of atombstone ratio cause this?
On 5/15/25 5:08 PM, Bowen Song via user wrote:
The dead node being replaced went back to DN state indicating the newreplacement node failed to join the cluster, usually because thestreaming was interrupted (e.g. by network issues, or long STW GCpauses). I would start looking for red flags in the logs, includingCassandra's logs, GC logs, dmesg, systemd journal, etc., on the newnode, and other nodes in the cluster too. Also, I would try `nodetoolbootstrap resume` on the replacement node.
On 12/05/2025 09:53, Courtney wrote:
Hello everyone,
I have a cluster with 2 datacenters. I am usingGossipingPropertyFileSnitch as my endpoint snitch. Cassandra version4.1.8. One datacenter is fully Ubuntu 24.04 and OpenJDK 11 andanother is Ubuntu 20.04 on OpenJDK 8. A seed node died in my secondDC running Ubuntu 20.04 hosts. I ordered a new dedicated server. Iupdated my seeds to forget the dead seed node. I did the steps toreplace a dead node
JVM_OPTS="$JVM_OPTS $JVM_EXTRA_OPTS-Dcassandra.replace_address_first_boot=<dead_node_ip>"
Configs between the old/new node are identical minus IP addressesand that line above in the env file to replace the dead node. Istarted the node and it started replacing the old node and was inthe `UJ` state. Not long into the process, the new node stopsprocessing data and the cluster forgets the new node and remembersthe old one in its `DN` state (which is turned off, no power). Thereare no errors in the logs. I've tried different times hoping tosolve the issue. I upped my ROOT logging level to DEBUG, I also set"org.apache.cassandra.gms.Gossiper TRACE". No errors.
With TRACE set for the Gossiper, I notice gossiping stops and datastopping streaming about the same time. I cannot run any nodetoolcommands on the new node. The process doesn't die, it leaves openconnections to nodes that are streaming data, but I don't see anydata streaming.
I've thought through a lot. Space isn't an issue, ulimits are sethigh in /etc/security/limits.conf. Checking /proc/<pid>/limits showsthe values are high. I've replaced nodes before like this withoutissue, but this one is causing me grief. Is there anything more Ican do?
Courtney

Re: Issue replacing a dead node

Reply via email to