What consistency level are you using on reads and writes? If either are less than LOCAL_QUORUM, this behavior is definitely expected.
If you ARE using quorum/local_quorum, and you can correlate these issues to when a node scales in / out, then it's probably consistency violation on bootstrap. Unless you run repair before you re-bootstrap a node (assuming you're using ephemeral disks), the bootstrap process may choose a streaming source that is missing ~1 write, and then you end up with 2 replicas without the write, so reads dont see it until chance read repair or repair is run. On 2025/10/10 18:30:41 FMH wrote: > Thanks for taking the time to Respond, Jeff. > > I was kind expecting this reply the minute I included version 2.2.8. We are > in the process of upgrading to 5. > > As for the bootstrap automation we use. It has been in effect for more than > 10 years and we replaced 100's of nodes without ever having any issues > including this one. > > This automation we put in place based on the available documentation for > Apache and Datastax cassandra. We have also had it assessed several times > over the years by external consultants. > > Thanks for clarifying the getendpoints. This is why we paired it with the > SELECT statement validation test as well to verify data. > > On Fri, Oct 10, 2025 at 1:54 PM Jeff Jirsa <[email protected]> wrote: > > > Also: nodetool getendpoints just hashes the key you provide against the > > cluster topology / schema definition, which tells you which nodes WOULD own > > the data if it exists. It does NOT guarantee that it exists. > > > > > > > > On 2025/10/10 17:48:32 Jeff Jirsa wrote: > > > You're using a 9 year old release. There have been literally hundreds of > > correctness fixes over those 9 years. You need to upgrade. > > > > > > The rest of your answers inline. > > > > > > > > > > > > On 2025/10/10 12:56:58 FMH wrote: > > > > Few times a week, our developers report that Cassandra retrieves are > > > > coming back with zero rows. No error messages. > > > > > > > > Using the same item ID's, a CQLSH SELECT statement returns a single > > row as > > > > expected. Furthermore, the NODETOOL GETENDPOINTS returns three IP's as > > we > > > > expect. > > > > > > > > This confirms these ItemID's do exist in Cassandra, it is just the Java > > > > clients are not retrieving it. > > > > > > > > We noticed this issue to present itself more when nodes are replaced > > in the > > > > cluster as a result of EC2 node deprecation. > > > > > > Are you using EBS or ephemeral disk? Don't use ephemeral disk unless you > > are much better at running cassandra and know how to replace a node without > > data loss (which you do not seem to know how to do). > > > > > > > > > > > > > > Once the developers restarted the Java client apps, it was now able to > > > > retrieve these ItemID's. > > > > > > That sounds weird. It may be that they read repaired or normal-repaired, > > or it may be that the java apps were pointing to the wrong thing/cluster. > > > > > > > > > > > 1- Is this what is called the 'empty' read' behavior? > > > > 2- Is this caused by clients topology metadata getting out of sync > > with the > > > > cluster? > > > > > > Could be cluster scaling unsafely due to ec2 events. > > > Could be low consistency level > > > Could be any number of hundreds of topology bugs fixed since 2016. > > > > > > If it's a client bug, I assume it's an old client bug I've never seen > > before. Well functioning cassandra clients shouldn't care about the > > topology, the coordinating server will forward the request anyway. > > > > > > > 3- How can this be detected? Should we have client drivers return > > 'metadata > > > > = cluster.metadata' and compare it to 'nodetool gossipinfo'? > > > > > > Upgrade your cluster. > > > Use EBS so when nodes change, they don't change data ownership. > > > > > > > 4- Other than restarting the clients, is there a way to have client > > apps to > > > > force to refresh their ring metadata? > > > > > > > > The client apps are using 'com.datastax.oss:java-driver-core:4.13.0' > > > > driver. > > > > > > > > Google returns little information about this and GenAI's chat model > > even > > > > though useful, they tend to hallucinate with confidence often. > > > > > > > > Thanks > > > > > > > > ---------------------------------------- > > > > Thank you > > > > > > > > > > > > -- > > ---------------------------------------- > Thank you >
