Re: Cassandra Clients: Stale Cluster Topology

Jeff Jirsa Sat, 18 Oct 2025 05:34:00 -0700

No. It’s actually almost a guarantee that what’s happening is that you violate 
consistency on node replacement and the read at ALL copies the data back to 
where it needed to be


Here’s what’s happening

Let’s pretend you have 6 nodes, A-F

When you write at LQ for a key, let’s pretend it goes to B, C, and D

2 of the three of those have to ack a write - let’s say it goes to B and D, but 
misses C. Write succeeds, reads will see it because any read will be either B 
and D (easy), B and C (data on B, visible, copies to C on read) or C and D 
(data on D, visible, copies to C on read)

Now imagine that before C gets the data by repair or read repair B fails and 
gets sent for maintenance by your cloud provider 

A new B’ gets added to the ring. B’ has to choose about to send it data. It’s 
going to choose either C or D.

If it chooses D, no problem, reads still always see data

If it chooses B, now D still has the data, but B and C don’t. So a read may 
choose:

B/C - data missing
B/D - repairs to B from D
C/D - repairs to C from D

If you do ALL, it repairs to B and C from D

If you use EBS instead of re-bootstrapping the data, the data drive can get 
re-attached and B still has the data. 

Alternatively, if you had a newer version, you could just run incremental 
repair “often and this would happen much less often (or never, if you force 
incremental repair to run before the bootstrap happens). Incremental repair in 
2.2 is not good enough - don’t try to use it until you upgrade 




On 2025/10/10 19:38:08 FMH wrote:
> Data read/written with CL=LQ.
> 
> To attempt to isolate the issue on the C* server vs. client, running a
> SELECT statement with CL=ALL returned the row 200 times with zero misses.
> 
> Retrieving the same ID through the java service had a 30% failure rate.
> 
> Isn't this conclusive enough that data does exist & has three replicas.
> The issue must be isolated to the client?
> 
> On Fri, Oct 10, 2025 at 2:47 PM Jeff Jirsa <[email protected]> wrote:
> 
> > What consistency level are you using on reads and writes?
> >
> > If either are less than LOCAL_QUORUM, this behavior is definitely expected.
> >
> > If you ARE using quorum/local_quorum, and you can correlate these issues
> > to when a node scales in / out, then it's probably consistency violation on
> > bootstrap.  Unless you run repair before you re-bootstrap a node (assuming
> > you're using ephemeral disks), the bootstrap process may choose a streaming
> > source that is missing ~1 write, and then you end up with 2 replicas
> > without the write, so reads dont see it until chance read repair or repair
> > is run.
> >
> >
> >
> > On 2025/10/10 18:30:41 FMH wrote:
> > > Thanks for taking the time to Respond, Jeff.
> > >
> > > I was kind expecting this reply the minute I included version 2.2.8. We
> > are
> > > in the process of upgrading to 5.
> > >
> > > As for the bootstrap automation we use. It has been in effect for more
> > than
> > > 10 years and we replaced 100's of nodes without ever having any issues
> > > including this one.
> > >
> > > This automation we put in place based on the available documentation for
> > > Apache and Datastax cassandra. We have also had it assessed several times
> > > over the years by external consultants.
> > >
> > > Thanks for clarifying the getendpoints. This is why we paired it with the
> > > SELECT statement validation test as well to verify data.
> > >
> > > On Fri, Oct 10, 2025 at 1:54 PM Jeff Jirsa <[email protected]> wrote:
> > >
> > > > Also: nodetool getendpoints just hashes the key you provide against the
> > > > cluster topology / schema definition, which tells you which nodes
> > WOULD own
> > > > the data if it exists. It does NOT guarantee that it exists.
> > > >
> > > >
> > > >
> > > > On 2025/10/10 17:48:32 Jeff Jirsa wrote:
> > > > > You're using a 9 year old release. There have been literally
> > hundreds of
> > > > correctness fixes over those 9 years. You need to upgrade.
> > > > >
> > > > > The rest of your answers inline.
> > > > >
> > > > >
> > > > >
> > > > > On 2025/10/10 12:56:58 FMH wrote:
> > > > > > Few times a week, our developers report that Cassandra retrieves
> > are
> > > > > > coming back with zero rows. No error messages.
> > > > > >
> > > > > > Using the same item ID's, a CQLSH SELECT statement returns a single
> > > > row as
> > > > > > expected. Furthermore, the NODETOOL GETENDPOINTS returns three
> > IP's as
> > > > we
> > > > > > expect.
> > > > > >
> > > > > > This confirms these ItemID's do exist in Cassandra, it is just the
> > Java
> > > > > > clients are not retrieving it.
> > > > > >
> > > > > > We noticed this issue to present itself more when nodes are
> > replaced
> > > > in the
> > > > > > cluster as a result of EC2 node deprecation.
> > > > >
> > > > > Are you using EBS or ephemeral disk? Don't use ephemeral disk unless
> > you
> > > > are much better at running cassandra and know how to replace a node
> > without
> > > > data loss (which you do not seem to know how to do).
> > > > >
> > > > >
> > > > > >
> > > > > > Once the developers restarted the Java client apps, it was now
> > able to
> > > > > > retrieve these ItemID's.
> > > > >
> > > > > That sounds weird. It may be that they read repaired or
> > normal-repaired,
> > > > or it may be that the java apps were pointing to the wrong
> > thing/cluster.
> > > > >
> > > > > >
> > > > > > 1- Is this what is called the 'empty' read' behavior?
> > > > > > 2- Is this caused by clients topology metadata getting out of sync
> > > > with the
> > > > > > cluster?
> > > > >
> > > > > Could be cluster scaling unsafely due to ec2 events.
> > > > > Could be low consistency level
> > > > > Could be any number of hundreds of topology bugs fixed since 2016.
> > > > >
> > > > > If it's a client bug, I assume it's an old client bug I've never seen
> > > > before. Well functioning cassandra clients shouldn't care about the
> > > > topology, the coordinating server will forward the request anyway.
> > > > >
> > > > > > 3- How can this be detected? Should we have client drivers return
> > > > 'metadata
> > > > > > = cluster.metadata' and compare it to 'nodetool gossipinfo'?
> > > > >
> > > > > Upgrade your cluster.
> > > > > Use EBS so when nodes change, they don't change data ownership.
> > > > >
> > > > > > 4- Other than restarting the clients, is there a way to have client
> > > > apps to
> > > > > > force to refresh their ring metadata?
> > > > > >
> > > > > > The client apps are using
> > 'com.datastax.oss:java-driver-core:4.13.0'
> > > > > > driver.
> > > > > >
> > > > > > Google returns little information about this and GenAI's chat model
> > > > even
> > > > > > though useful, they tend to hallucinate with confidence often.
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > ----------------------------------------
> > > > > > Thank you
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > ----------------------------------------
> > > Thank you
> > >
> >
> 
> 
> -- 
> 
> ----------------------------------------
> Thank you
>

Re: Cassandra Clients: Stale Cluster Topology

Reply via email to