Re: Cassandra Clients: Stale Cluster Topology

Jeff Jirsa Sat, 11 Oct 2025 11:48:51 -0700

Correct yes, apologies, I was typing this on my phone while watching TV.

If the new, joining node chooses a node in the ring that missed a write, then 2 
nodes are missing the write, and reads at quorum/local_quorum do not see the 
data.


FWIW, the "data" they see may also be a tombstone / deletion marker, which 
means you could ALSO "resurrect" deleted data with this type of operation, just 
in case that isn't obvious. "Empty" reads are at least easy to detect, 
"partial" or "resurrected" reads are harder to detect. 



On 2025/10/11 13:36:08 dbms-tech wrote:
> In your statement ...
> If it chooses B, now D still has the data, but B and C don’t. So a read may
> choose:
> 
> I think, you meant to say
> If it chooses C, now D still has the data, but B and C don’t. So a read may
> choose:
> 
> Thanks
> 
> On Sat, Oct 11, 2025 at 12:04 AM Jeff Jirsa <[email protected]> wrote:
> 
> > No. It’s actually almost a guarantee that what’s happening is that you
> > violate consistency on node replacement and the read at ALL copies the data
> > back to where it needed to be
> >
> > Here’s what’s happening
> >
> > Let’s pretend you have 6 nodes, A-F
> >
> > When you write at LQ for a key, let’s pretend it goes to B, C, and D
> >
> > 2 of the three of those have to ack a write - let’s say it goes to B and
> > D, but misses C. Write succeeds, reads will see it because any read will be
> > either B and D (easy), B and C (data on B, visible, copies to C on read) or
> > C and D (data on D, visible, copies to C on read)
> >
> > Now imagine that before C gets the data by repair or read repair B fails
> > and gets sent for maintenance by your cloud provider
> >
> > A new B’ gets added to the ring. B’ has to choose about to send it data.
> > It’s going to choose either C or D.
> >
> > If it chooses D, no problem, reads still always see data
> >
> > If it chooses B, now D still has the data, but B and C don’t. So a read
> > may choose:
> >
> > B/C - data missing
> > B/D - repairs to B from D
> > C/D - repairs to C from D
> >
> > If you do ALL, it repairs to B and C from D
> >
> > If you use EBS instead of re-bootstrapping the data, the data drive can
> > get re-attached and B still has the data.
> >
> > Alternatively, if you had a newer version, you could just run incremental
> > repair “often and this would happen much less often (or never, if you force
> > incremental repair to run before the bootstrap happens). Incremental repair
> > in 2.2 is not good enough - don’t try to use it until you upgrade
> >
> >
> >
> >
> > On 2025/10/10 19:38:08 FMH wrote:
> > > Data read/written with CL=LQ.
> > >
> > > To attempt to isolate the issue on the C* server vs. client, running a
> > > SELECT statement with CL=ALL returned the row 200 times with zero misses.
> > >
> > > Retrieving the same ID through the java service had a 30% failure rate.
> > >
> > > Isn't this conclusive enough that data does exist & has three replicas.
> > > The issue must be isolated to the client?
> > >
> > > On Fri, Oct 10, 2025 at 2:47 PM Jeff Jirsa <[email protected]> wrote:
> > >
> > > > What consistency level are you using on reads and writes?
> > > >
> > > > If either are less than LOCAL_QUORUM, this behavior is definitely
> > expected.
> > > >
> > > > If you ARE using quorum/local_quorum, and you can correlate these
> > issues
> > > > to when a node scales in / out, then it's probably consistency
> > violation on
> > > > bootstrap.  Unless you run repair before you re-bootstrap a node
> > (assuming
> > > > you're using ephemeral disks), the bootstrap process may choose a
> > streaming
> > > > source that is missing ~1 write, and then you end up with 2 replicas
> > > > without the write, so reads dont see it until chance read repair or
> > repair
> > > > is run.
> > > >
> > > >
> > > >
> > > > On 2025/10/10 18:30:41 FMH wrote:
> > > > > Thanks for taking the time to Respond, Jeff.
> > > > >
> > > > > I was kind expecting this reply the minute I included version 2.2.8.
> > We
> > > > are
> > > > > in the process of upgrading to 5.
> > > > >
> > > > > As for the bootstrap automation we use. It has been in effect for
> > more
> > > > than
> > > > > 10 years and we replaced 100's of nodes without ever having any
> > issues
> > > > > including this one.
> > > > >
> > > > > This automation we put in place based on the available documentation
> > for
> > > > > Apache and Datastax cassandra. We have also had it assessed several
> > times
> > > > > over the years by external consultants.
> > > > >
> > > > > Thanks for clarifying the getendpoints. This is why we paired it
> > with the
> > > > > SELECT statement validation test as well to verify data.
> > > > >
> > > > > On Fri, Oct 10, 2025 at 1:54 PM Jeff Jirsa <[email protected]>
> > wrote:
> > > > >
> > > > > > Also: nodetool getendpoints just hashes the key you provide
> > against the
> > > > > > cluster topology / schema definition, which tells you which nodes
> > > > WOULD own
> > > > > > the data if it exists. It does NOT guarantee that it exists.
> > > > > >
> > > > > >
> > > > > >
> > > > > > On 2025/10/10 17:48:32 Jeff Jirsa wrote:
> > > > > > > You're using a 9 year old release. There have been literally
> > > > hundreds of
> > > > > > correctness fixes over those 9 years. You need to upgrade.
> > > > > > >
> > > > > > > The rest of your answers inline.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On 2025/10/10 12:56:58 FMH wrote:
> > > > > > > > Few times a week, our developers report that Cassandra
> > retrieves
> > > > are
> > > > > > > > coming back with zero rows. No error messages.
> > > > > > > >
> > > > > > > > Using the same item ID's, a CQLSH SELECT statement returns a
> > single
> > > > > > row as
> > > > > > > > expected. Furthermore, the NODETOOL GETENDPOINTS returns three
> > > > IP's as
> > > > > > we
> > > > > > > > expect.
> > > > > > > >
> > > > > > > > This confirms these ItemID's do exist in Cassandra, it is just
> > the
> > > > Java
> > > > > > > > clients are not retrieving it.
> > > > > > > >
> > > > > > > > We noticed this issue to present itself more when nodes are
> > > > replaced
> > > > > > in the
> > > > > > > > cluster as a result of EC2 node deprecation.
> > > > > > >
> > > > > > > Are you using EBS or ephemeral disk? Don't use ephemeral disk
> > unless
> > > > you
> > > > > > are much better at running cassandra and know how to replace a node
> > > > without
> > > > > > data loss (which you do not seem to know how to do).
> > > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > > Once the developers restarted the Java client apps, it was now
> > > > able to
> > > > > > > > retrieve these ItemID's.
> > > > > > >
> > > > > > > That sounds weird. It may be that they read repaired or
> > > > normal-repaired,
> > > > > > or it may be that the java apps were pointing to the wrong
> > > > thing/cluster.
> > > > > > >
> > > > > > > >
> > > > > > > > 1- Is this what is called the 'empty' read' behavior?
> > > > > > > > 2- Is this caused by clients topology metadata getting out of
> > sync
> > > > > > with the
> > > > > > > > cluster?
> > > > > > >
> > > > > > > Could be cluster scaling unsafely due to ec2 events.
> > > > > > > Could be low consistency level
> > > > > > > Could be any number of hundreds of topology bugs fixed since
> > 2016.
> > > > > > >
> > > > > > > If it's a client bug, I assume it's an old client bug I've never
> > seen
> > > > > > before. Well functioning cassandra clients shouldn't care about the
> > > > > > topology, the coordinating server will forward the request anyway.
> > > > > > >
> > > > > > > > 3- How can this be detected? Should we have client drivers
> > return
> > > > > > 'metadata
> > > > > > > > = cluster.metadata' and compare it to 'nodetool gossipinfo'?
> > > > > > >
> > > > > > > Upgrade your cluster.
> > > > > > > Use EBS so when nodes change, they don't change data ownership.
> > > > > > >
> > > > > > > > 4- Other than restarting the clients, is there a way to have
> > client
> > > > > > apps to
> > > > > > > > force to refresh their ring metadata?
> > > > > > > >
> > > > > > > > The client apps are using
> > > > 'com.datastax.oss:java-driver-core:4.13.0'
> > > > > > > > driver.
> > > > > > > >
> > > > > > > > Google returns little information about this and GenAI's chat
> > model
> > > > > > even
> > > > > > > > though useful, they tend to hallucinate with confidence often.
> > > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > > > ----------------------------------------
> > > > > > > > Thank you
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > ----------------------------------------
> > > > > Thank you
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > ----------------------------------------
> > > Thank you
> > >
> >
> 
> 
> -- 
> 
> ----------------------------------------
> Thank you
>

Re: Cassandra Clients: Stale Cluster Topology

Reply via email to