Hi
We have a Cassandra solution with 2 DCs where each DC has >30 nodes
From time to time we see problems with READ REPAIR, but I am stuck with the
analysis
We have a pattern for these faults where we do
1. INSERT with Local Quorum (2 out of 3)
2. Wait for 0.5 - 1 seconds time window
3. READ with Local Quorum (2 out of 3)
* Triggers a read repair
4. Then we do an UPDATE …
The replication factor is 3
In my world in (1) we for sure store the data in 2 out of 3 places, and I would
be surprised if we would not also reach the 3;rd node within 0.5 sec
So how come in (3) the read can’t get a proper response from 2 out of 3
Some are saying the problem started occurring when we added DC2, but I can’t
understand how it could be as our query is Local Quorum and will involve only
DC1
How can I debug this fault ?
How can I track if the data has reached all 3 nodes ?
All ideas are welcome
-Tobias