Troubleshooting random node latency spikes

Ted Pearson Thu, 05 Jan 2017 11:34:57 -0800

Greetings!
I'm working on setting up a new cassandra cluster with a write-heavy workload 
(50% writes), and I've run into a strange spiky latency problem. My application 
metrics showed random latency spikes. I tracked the latency back to spikes on 
individual cassandra nodes. ClientRequest.Latency.Read/Write.p99 is 
occasionally jumping on one node at a time to several seconds, instead of its 
normal value of around 1000 microseconds. I also noticed that 
ReadRepair.RepairedBackground.m1_rate goes from zero to a non-zero (around 
1-2/sec) during the spike on that node. I'm lost as to why these spikes are 
happening, hope someone can give me ideas.
I attempted to test if the ReadRepair metric is causally linked to the latency 
spikes, but even when I changed dclocal_read_repair_chance to 0 on my tables, 
even though the metrics showed no ReadRepair.Attempted, the 
ReadRepair.RepairedBackground metric still went up during latency spikes. Am I 
misunderstanding what this metric tracks? I don't understand why it went up if 
I turned off read repair.
I'm currently running 2.2.6 in a dual-datacenter setup. It's patched to allow 
metrics to be recency-biased instead of tracking latency over the entire 
running of the java process. I'm using STCS. There is a large amount of data 
per node, about 500GB currently. I expect each row to be less than 10KB. It's 
currently running on way overpowered hardware - 512GB/raid 0 on nvme/44 cores 
on 2 sockets. All of my queries (reads and writes) are LOCAL_ONE and I'm using 
r=3.


Thanks,
Ted

Troubleshooting random node latency spikes

Reply via email to