Btw, the C* version is 2.2.5, with several backported patches. On Sun, Jan 22, 2017 at 10:36 PM, Dikang Gu <dikan...@gmail.com> wrote:
> Hello there, > > We have a 100 nodes ish cluster, I find that there are dropped messages on > random nodes in the cluster, which caused error spikes and P99 latency > spikes as well. > > I tried to figure out the cause. I do not see any obvious bottleneck in > the cluster, the C* nodes still have plenty of cpu idle/disk io. But I do > see some suspicious gossip events around that time, not sure if it's > related. > > 2017-01-21_16:43:56.71033 WARN 16:43:56 [GossipTasks:1]: Not marking > nodes down due to local pause of 13079498815 > 5000000000 > 2017-01-21_16:43:56.85532 INFO 16:43:56 [ScheduledTasks:1]: MUTATION > messages were dropped in last 5000 ms: 65 for internal timeout and 10895 > for cross node timeout > 2017-01-21_16:43:56.85533 INFO 16:43:56 [ScheduledTasks:1]: READ messages > were dropped in last 5000 ms: 33 for internal timeout and 7867 for cross > node timeout > 2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: Pool Name > Active Pending Completed Blocked All Time Blocked > 2017-01-21_16:43:56.85534 INFO 16:43:56 [ScheduledTasks:1]: MutationStage > 128 47794 1015525068 0 0 > 2017-01-21_16:43:56.85535 > 2017-01-21_16:43:56.85535 INFO 16:43:56 [ScheduledTasks:1]: ReadStage > 64 20202 450508940 0 0 > > Any suggestions? > > Thanks! > > -- > Dikang > > -- Dikang