Re: Cassandra Clients: Stale Cluster Topology

dbms-tech Sat, 18 Oct 2025 02:59:36 -0700

In your statement ...
If it chooses B, now D still has the data, but B and C don’t. So a read may
choose:


I think, you meant to say
If it chooses C, now D still has the data, but B and C don’t. So a read may
choose:

Thanks

On Sat, Oct 11, 2025 at 12:04 AM Jeff Jirsa <[email protected]> wrote:

> No. It’s actually almost a guarantee that what’s happening is that you
> violate consistency on node replacement and the read at ALL copies the data
> back to where it needed to be
>
> Here’s what’s happening
>
> Let’s pretend you have 6 nodes, A-F
>
> When you write at LQ for a key, let’s pretend it goes to B, C, and D
>
> 2 of the three of those have to ack a write - let’s say it goes to B and
> D, but misses C. Write succeeds, reads will see it because any read will be
> either B and D (easy), B and C (data on B, visible, copies to C on read) or
> C and D (data on D, visible, copies to C on read)
>
> Now imagine that before C gets the data by repair or read repair B fails
> and gets sent for maintenance by your cloud provider
>
> A new B’ gets added to the ring. B’ has to choose about to send it data.
> It’s going to choose either C or D.
>
> If it chooses D, no problem, reads still always see data
>
> If it chooses B, now D still has the data, but B and C don’t. So a read
> may choose:
>
> B/C - data missing
> B/D - repairs to B from D
> C/D - repairs to C from D
>
> If you do ALL, it repairs to B and C from D
>
> If you use EBS instead of re-bootstrapping the data, the data drive can
> get re-attached and B still has the data.
>
> Alternatively, if you had a newer version, you could just run incremental
> repair “often and this would happen much less often (or never, if you force
> incremental repair to run before the bootstrap happens). Incremental repair
> in 2.2 is not good enough - don’t try to use it until you upgrade
>
>
>
>
> On 2025/10/10 19:38:08 FMH wrote:
> > Data read/written with CL=LQ.
> >
> > To attempt to isolate the issue on the C* server vs. client, running a
> > SELECT statement with CL=ALL returned the row 200 times with zero misses.
> >
> > Retrieving the same ID through the java service had a 30% failure rate.
> >
> > Isn't this conclusive enough that data does exist & has three replicas.
> > The issue must be isolated to the client?
> >
> > On Fri, Oct 10, 2025 at 2:47 PM Jeff Jirsa <[email protected]> wrote:
> >
> > > What consistency level are you using on reads and writes?
> > >
> > > If either are less than LOCAL_QUORUM, this behavior is definitely
> expected.
> > >
> > > If you ARE using quorum/local_quorum, and you can correlate these
> issues
> > > to when a node scales in / out, then it's probably consistency
> violation on
> > > bootstrap.  Unless you run repair before you re-bootstrap a node
> (assuming
> > > you're using ephemeral disks), the bootstrap process may choose a
> streaming
> > > source that is missing ~1 write, and then you end up with 2 replicas
> > > without the write, so reads dont see it until chance read repair or
> repair
> > > is run.
> > >
> > >
> > >
> > > On 2025/10/10 18:30:41 FMH wrote:
> > > > Thanks for taking the time to Respond, Jeff.
> > > >
> > > > I was kind expecting this reply the minute I included version 2.2.8.
> We
> > > are
> > > > in the process of upgrading to 5.
> > > >
> > > > As for the bootstrap automation we use. It has been in effect for
> more
> > > than
> > > > 10 years and we replaced 100's of nodes without ever having any
> issues
> > > > including this one.
> > > >
> > > > This automation we put in place based on the available documentation
> for
> > > > Apache and Datastax cassandra. We have also had it assessed several
> times
> > > > over the years by external consultants.
> > > >
> > > > Thanks for clarifying the getendpoints. This is why we paired it
> with the
> > > > SELECT statement validation test as well to verify data.
> > > >
> > > > On Fri, Oct 10, 2025 at 1:54 PM Jeff Jirsa <[email protected]>
> wrote:
> > > >
> > > > > Also: nodetool getendpoints just hashes the key you provide
> against the
> > > > > cluster topology / schema definition, which tells you which nodes
> > > WOULD own
> > > > > the data if it exists. It does NOT guarantee that it exists.
> > > > >
> > > > >
> > > > >
> > > > > On 2025/10/10 17:48:32 Jeff Jirsa wrote:
> > > > > > You're using a 9 year old release. There have been literally
> > > hundreds of
> > > > > correctness fixes over those 9 years. You need to upgrade.
> > > > > >
> > > > > > The rest of your answers inline.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 2025/10/10 12:56:58 FMH wrote:
> > > > > > > Few times a week, our developers report that Cassandra
> retrieves
> > > are
> > > > > > > coming back with zero rows. No error messages.
> > > > > > >
> > > > > > > Using the same item ID's, a CQLSH SELECT statement returns a
> single
> > > > > row as
> > > > > > > expected. Furthermore, the NODETOOL GETENDPOINTS returns three
> > > IP's as
> > > > > we
> > > > > > > expect.
> > > > > > >
> > > > > > > This confirms these ItemID's do exist in Cassandra, it is just
> the
> > > Java
> > > > > > > clients are not retrieving it.
> > > > > > >
> > > > > > > We noticed this issue to present itself more when nodes are
> > > replaced
> > > > > in the
> > > > > > > cluster as a result of EC2 node deprecation.
> > > > > >
> > > > > > Are you using EBS or ephemeral disk? Don't use ephemeral disk
> unless
> > > you
> > > > > are much better at running cassandra and know how to replace a node
> > > without
> > > > > data loss (which you do not seem to know how to do).
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Once the developers restarted the Java client apps, it was now
> > > able to
> > > > > > > retrieve these ItemID's.
> > > > > >
> > > > > > That sounds weird. It may be that they read repaired or
> > > normal-repaired,
> > > > > or it may be that the java apps were pointing to the wrong
> > > thing/cluster.
> > > > > >
> > > > > > >
> > > > > > > 1- Is this what is called the 'empty' read' behavior?
> > > > > > > 2- Is this caused by clients topology metadata getting out of
> sync
> > > > > with the
> > > > > > > cluster?
> > > > > >
> > > > > > Could be cluster scaling unsafely due to ec2 events.
> > > > > > Could be low consistency level
> > > > > > Could be any number of hundreds of topology bugs fixed since
> 2016.
> > > > > >
> > > > > > If it's a client bug, I assume it's an old client bug I've never
> seen
> > > > > before. Well functioning cassandra clients shouldn't care about the
> > > > > topology, the coordinating server will forward the request anyway.
> > > > > >
> > > > > > > 3- How can this be detected? Should we have client drivers
> return
> > > > > 'metadata
> > > > > > > = cluster.metadata' and compare it to 'nodetool gossipinfo'?
> > > > > >
> > > > > > Upgrade your cluster.
> > > > > > Use EBS so when nodes change, they don't change data ownership.
> > > > > >
> > > > > > > 4- Other than restarting the clients, is there a way to have
> client
> > > > > apps to
> > > > > > > force to refresh their ring metadata?
> > > > > > >
> > > > > > > The client apps are using
> > > 'com.datastax.oss:java-driver-core:4.13.0'
> > > > > > > driver.
> > > > > > >
> > > > > > > Google returns little information about this and GenAI's chat
> model
> > > > > even
> > > > > > > though useful, they tend to hallucinate with confidence often.
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > ----------------------------------------
> > > > > > > Thank you
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > ----------------------------------------
> > > > Thank you
> > > >
> > >
> >
> >
> > --
> >
> > ----------------------------------------
> > Thank you
> >
>


-- 

----------------------------------------
Thank you

Re: Cassandra Clients: Stale Cluster Topology

Reply via email to