This is a 2.8.8. cluster with three AWS AZs, each with 4 nodes.
Few days ago we noticed a single node’s read latency reaching 1.5 secs there
was 8 others with read latencies going up near 900 ms.
This single node was a seed node and it was running a ‘repair -pr’ at the time.
We intervened as follows …
• Stopping compactions during repair did not improve latency.
• Killing repair brought down latency to 200 ms on the seed node and the other
• Restarting C* on the seed node increased latency again back to near 1.5 secs
on the seed and other 8. At this point, there was no repair running and
compactions were running. We left them alone.
At this point, we saw that putting the seed node back in the cluster
consistently worsened latencies on seed and 8 nodes = 9 out of the 12 nodes in
So we decided to bootstrap it. During the bootstrapping and afterwards,
latencies remained near 200 ms which is what we wanted for now.
All we were able to see is that the seed node in question was different in that
it had 5000 sstables while all others had around 2300. After bootstrap, seed
node sstables reduced to 2500.
Why would starting C* on a single seed node affect the cluster this bad? Again,
no repair just 4 compactions that run routinely on it as well all others. Is it
gossip? What other plausible explanations are there?