Hi We have a Cassandra solution with 2 DCs where each DC has >30 nodes From time to time we see problems with READ REPAIR, but I am stuck with the analysis We have a pattern for these faults where we do
1. INSERT with Local Quorum (2 out of 3) 2. Wait for 0.5 - 1 seconds time window 3. READ with Local Quorum (2 out of 3) * Triggers a read repair 4. Then we do an UPDATE … The replication factor is 3 In my world in (1) we for sure store the data in 2 out of 3 places, and I would be surprised if we would not also reach the 3;rd node within 0.5 sec So how come in (3) the read can’t get a proper response from 2 out of 3 Some are saying the problem started occurring when we added DC2, but I can’t understand how it could be as our query is Local Quorum and will involve only DC1 How can I debug this fault ? How can I track if the data has reached all 3 nodes ? All ideas are welcome -Tobias