Greetings! I'm working on setting up a new cassandra cluster with a write-heavy workload (50% writes), and I've run into a strange spiky latency problem. My application metrics showed random latency spikes. I tracked the latency back to spikes on individual cassandra nodes. ClientRequest.Latency.Read/Write.p99 is occasionally jumping on one node at a time to several seconds, instead of its normal value of around 1000 microseconds. I also noticed that ReadRepair.RepairedBackground.m1_rate goes from zero to a non-zero (around 1-2/sec) during the spike on that node. I'm lost as to why these spikes are happening, hope someone can give me ideas. I attempted to test if the ReadRepair metric is causally linked to the latency spikes, but even when I changed dclocal_read_repair_chance to 0 on my tables, even though the metrics showed no ReadRepair.Attempted, the ReadRepair.RepairedBackground metric still went up during latency spikes. Am I misunderstanding what this metric tracks? I don't understand why it went up if I turned off read repair. I'm currently running 2.2.6 in a dual-datacenter setup. It's patched to allow metrics to be recency-biased instead of tracking latency over the entire running of the java process. I'm using STCS. There is a large amount of data per node, about 500GB currently. I expect each row to be less than 10KB. It's currently running on way overpowered hardware - 512GB/raid 0 on nvme/44 cores on 2 sockets. All of my queries (reads and writes) are LOCAL_ONE and I'm using r=3.
Thanks, Ted