Data read/written with CL=LQ. To attempt to isolate the issue on the C* server vs. client, running a SELECT statement with CL=ALL returned the row 200 times with zero misses.
Retrieving the same ID through the java service had a 30% failure rate. Isn't this conclusive enough that data does exist & has three replicas. The issue must be isolated to the client? On Fri, Oct 10, 2025 at 2:47 PM Jeff Jirsa <[email protected]> wrote: > What consistency level are you using on reads and writes? > > If either are less than LOCAL_QUORUM, this behavior is definitely expected. > > If you ARE using quorum/local_quorum, and you can correlate these issues > to when a node scales in / out, then it's probably consistency violation on > bootstrap. Unless you run repair before you re-bootstrap a node (assuming > you're using ephemeral disks), the bootstrap process may choose a streaming > source that is missing ~1 write, and then you end up with 2 replicas > without the write, so reads dont see it until chance read repair or repair > is run. > > > > On 2025/10/10 18:30:41 FMH wrote: > > Thanks for taking the time to Respond, Jeff. > > > > I was kind expecting this reply the minute I included version 2.2.8. We > are > > in the process of upgrading to 5. > > > > As for the bootstrap automation we use. It has been in effect for more > than > > 10 years and we replaced 100's of nodes without ever having any issues > > including this one. > > > > This automation we put in place based on the available documentation for > > Apache and Datastax cassandra. We have also had it assessed several times > > over the years by external consultants. > > > > Thanks for clarifying the getendpoints. This is why we paired it with the > > SELECT statement validation test as well to verify data. > > > > On Fri, Oct 10, 2025 at 1:54 PM Jeff Jirsa <[email protected]> wrote: > > > > > Also: nodetool getendpoints just hashes the key you provide against the > > > cluster topology / schema definition, which tells you which nodes > WOULD own > > > the data if it exists. It does NOT guarantee that it exists. > > > > > > > > > > > > On 2025/10/10 17:48:32 Jeff Jirsa wrote: > > > > You're using a 9 year old release. There have been literally > hundreds of > > > correctness fixes over those 9 years. You need to upgrade. > > > > > > > > The rest of your answers inline. > > > > > > > > > > > > > > > > On 2025/10/10 12:56:58 FMH wrote: > > > > > Few times a week, our developers report that Cassandra retrieves > are > > > > > coming back with zero rows. No error messages. > > > > > > > > > > Using the same item ID's, a CQLSH SELECT statement returns a single > > > row as > > > > > expected. Furthermore, the NODETOOL GETENDPOINTS returns three > IP's as > > > we > > > > > expect. > > > > > > > > > > This confirms these ItemID's do exist in Cassandra, it is just the > Java > > > > > clients are not retrieving it. > > > > > > > > > > We noticed this issue to present itself more when nodes are > replaced > > > in the > > > > > cluster as a result of EC2 node deprecation. > > > > > > > > Are you using EBS or ephemeral disk? Don't use ephemeral disk unless > you > > > are much better at running cassandra and know how to replace a node > without > > > data loss (which you do not seem to know how to do). > > > > > > > > > > > > > > > > > > Once the developers restarted the Java client apps, it was now > able to > > > > > retrieve these ItemID's. > > > > > > > > That sounds weird. It may be that they read repaired or > normal-repaired, > > > or it may be that the java apps were pointing to the wrong > thing/cluster. > > > > > > > > > > > > > > 1- Is this what is called the 'empty' read' behavior? > > > > > 2- Is this caused by clients topology metadata getting out of sync > > > with the > > > > > cluster? > > > > > > > > Could be cluster scaling unsafely due to ec2 events. > > > > Could be low consistency level > > > > Could be any number of hundreds of topology bugs fixed since 2016. > > > > > > > > If it's a client bug, I assume it's an old client bug I've never seen > > > before. Well functioning cassandra clients shouldn't care about the > > > topology, the coordinating server will forward the request anyway. > > > > > > > > > 3- How can this be detected? Should we have client drivers return > > > 'metadata > > > > > = cluster.metadata' and compare it to 'nodetool gossipinfo'? > > > > > > > > Upgrade your cluster. > > > > Use EBS so when nodes change, they don't change data ownership. > > > > > > > > > 4- Other than restarting the clients, is there a way to have client > > > apps to > > > > > force to refresh their ring metadata? > > > > > > > > > > The client apps are using > 'com.datastax.oss:java-driver-core:4.13.0' > > > > > driver. > > > > > > > > > > Google returns little information about this and GenAI's chat model > > > even > > > > > though useful, they tend to hallucinate with confidence often. > > > > > > > > > > Thanks > > > > > > > > > > ---------------------------------------- > > > > > Thank you > > > > > > > > > > > > > > > > > > -- > > > > ---------------------------------------- > > Thank you > > > -- ---------------------------------------- Thank you
