In your statement ... If it chooses B, now D still has the data, but B and C don’t. So a read may choose:
I think, you meant to say If it chooses C, now D still has the data, but B and C don’t. So a read may choose: Thanks On Sat, Oct 11, 2025 at 12:04 AM Jeff Jirsa <[email protected]> wrote: > No. It’s actually almost a guarantee that what’s happening is that you > violate consistency on node replacement and the read at ALL copies the data > back to where it needed to be > > Here’s what’s happening > > Let’s pretend you have 6 nodes, A-F > > When you write at LQ for a key, let’s pretend it goes to B, C, and D > > 2 of the three of those have to ack a write - let’s say it goes to B and > D, but misses C. Write succeeds, reads will see it because any read will be > either B and D (easy), B and C (data on B, visible, copies to C on read) or > C and D (data on D, visible, copies to C on read) > > Now imagine that before C gets the data by repair or read repair B fails > and gets sent for maintenance by your cloud provider > > A new B’ gets added to the ring. B’ has to choose about to send it data. > It’s going to choose either C or D. > > If it chooses D, no problem, reads still always see data > > If it chooses B, now D still has the data, but B and C don’t. So a read > may choose: > > B/C - data missing > B/D - repairs to B from D > C/D - repairs to C from D > > If you do ALL, it repairs to B and C from D > > If you use EBS instead of re-bootstrapping the data, the data drive can > get re-attached and B still has the data. > > Alternatively, if you had a newer version, you could just run incremental > repair “often and this would happen much less often (or never, if you force > incremental repair to run before the bootstrap happens). Incremental repair > in 2.2 is not good enough - don’t try to use it until you upgrade > > > > > On 2025/10/10 19:38:08 FMH wrote: > > Data read/written with CL=LQ. > > > > To attempt to isolate the issue on the C* server vs. client, running a > > SELECT statement with CL=ALL returned the row 200 times with zero misses. > > > > Retrieving the same ID through the java service had a 30% failure rate. > > > > Isn't this conclusive enough that data does exist & has three replicas. > > The issue must be isolated to the client? > > > > On Fri, Oct 10, 2025 at 2:47 PM Jeff Jirsa <[email protected]> wrote: > > > > > What consistency level are you using on reads and writes? > > > > > > If either are less than LOCAL_QUORUM, this behavior is definitely > expected. > > > > > > If you ARE using quorum/local_quorum, and you can correlate these > issues > > > to when a node scales in / out, then it's probably consistency > violation on > > > bootstrap. Unless you run repair before you re-bootstrap a node > (assuming > > > you're using ephemeral disks), the bootstrap process may choose a > streaming > > > source that is missing ~1 write, and then you end up with 2 replicas > > > without the write, so reads dont see it until chance read repair or > repair > > > is run. > > > > > > > > > > > > On 2025/10/10 18:30:41 FMH wrote: > > > > Thanks for taking the time to Respond, Jeff. > > > > > > > > I was kind expecting this reply the minute I included version 2.2.8. > We > > > are > > > > in the process of upgrading to 5. > > > > > > > > As for the bootstrap automation we use. It has been in effect for > more > > > than > > > > 10 years and we replaced 100's of nodes without ever having any > issues > > > > including this one. > > > > > > > > This automation we put in place based on the available documentation > for > > > > Apache and Datastax cassandra. We have also had it assessed several > times > > > > over the years by external consultants. > > > > > > > > Thanks for clarifying the getendpoints. This is why we paired it > with the > > > > SELECT statement validation test as well to verify data. > > > > > > > > On Fri, Oct 10, 2025 at 1:54 PM Jeff Jirsa <[email protected]> > wrote: > > > > > > > > > Also: nodetool getendpoints just hashes the key you provide > against the > > > > > cluster topology / schema definition, which tells you which nodes > > > WOULD own > > > > > the data if it exists. It does NOT guarantee that it exists. > > > > > > > > > > > > > > > > > > > > On 2025/10/10 17:48:32 Jeff Jirsa wrote: > > > > > > You're using a 9 year old release. There have been literally > > > hundreds of > > > > > correctness fixes over those 9 years. You need to upgrade. > > > > > > > > > > > > The rest of your answers inline. > > > > > > > > > > > > > > > > > > > > > > > > On 2025/10/10 12:56:58 FMH wrote: > > > > > > > Few times a week, our developers report that Cassandra > retrieves > > > are > > > > > > > coming back with zero rows. No error messages. > > > > > > > > > > > > > > Using the same item ID's, a CQLSH SELECT statement returns a > single > > > > > row as > > > > > > > expected. Furthermore, the NODETOOL GETENDPOINTS returns three > > > IP's as > > > > > we > > > > > > > expect. > > > > > > > > > > > > > > This confirms these ItemID's do exist in Cassandra, it is just > the > > > Java > > > > > > > clients are not retrieving it. > > > > > > > > > > > > > > We noticed this issue to present itself more when nodes are > > > replaced > > > > > in the > > > > > > > cluster as a result of EC2 node deprecation. > > > > > > > > > > > > Are you using EBS or ephemeral disk? Don't use ephemeral disk > unless > > > you > > > > > are much better at running cassandra and know how to replace a node > > > without > > > > > data loss (which you do not seem to know how to do). > > > > > > > > > > > > > > > > > > > > > > > > > > Once the developers restarted the Java client apps, it was now > > > able to > > > > > > > retrieve these ItemID's. > > > > > > > > > > > > That sounds weird. It may be that they read repaired or > > > normal-repaired, > > > > > or it may be that the java apps were pointing to the wrong > > > thing/cluster. > > > > > > > > > > > > > > > > > > > > 1- Is this what is called the 'empty' read' behavior? > > > > > > > 2- Is this caused by clients topology metadata getting out of > sync > > > > > with the > > > > > > > cluster? > > > > > > > > > > > > Could be cluster scaling unsafely due to ec2 events. > > > > > > Could be low consistency level > > > > > > Could be any number of hundreds of topology bugs fixed since > 2016. > > > > > > > > > > > > If it's a client bug, I assume it's an old client bug I've never > seen > > > > > before. Well functioning cassandra clients shouldn't care about the > > > > > topology, the coordinating server will forward the request anyway. > > > > > > > > > > > > > 3- How can this be detected? Should we have client drivers > return > > > > > 'metadata > > > > > > > = cluster.metadata' and compare it to 'nodetool gossipinfo'? > > > > > > > > > > > > Upgrade your cluster. > > > > > > Use EBS so when nodes change, they don't change data ownership. > > > > > > > > > > > > > 4- Other than restarting the clients, is there a way to have > client > > > > > apps to > > > > > > > force to refresh their ring metadata? > > > > > > > > > > > > > > The client apps are using > > > 'com.datastax.oss:java-driver-core:4.13.0' > > > > > > > driver. > > > > > > > > > > > > > > Google returns little information about this and GenAI's chat > model > > > > > even > > > > > > > though useful, they tend to hallucinate with confidence often. > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > ---------------------------------------- > > > > > > > Thank you > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > ---------------------------------------- > > > > Thank you > > > > > > > > > > > > > -- > > > > ---------------------------------------- > > Thank you > > > -- ---------------------------------------- Thank you
