Hi, We have a cluster of 4 nodes all in one DC (apache cass version : 3.11.6). Things were working fine till last month when all of a sudden we started facing Operation time outs at client end intermittently.
We have prometheus+grafana configured for monitoring. On checking we found the following points: 1. Read/Write latency at coordinator level increase at the same time on one or multiple nodes. 2. jvm_threads_current increases at the same time on one or multiple nodes. 3. Cassandra hints storage increases at the same time on one or multiple nodes. 4. Increase in Client Requests + Connection timeout with dropped messages at times. 5. Increase in connected native clients count on all nodes Things already checked : 1. No change in read/write requests. 2. No major change in table level read/write latency. 3. No spike on DB load or CPU utilization. 4. Memory usage is also normal. 5. GC Pauses are also normal. 6. No packet loss between nodes at network level On checking detailed logs, we found majorly two types of messages during timeouts. 1. HintsDispatcher : Dispatching hints from one node to other 2. READ messages were dropped in last 5000 ms: 0 internal and 1 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 14831 ms 3. StatusLogger messages. Please suggest possible reasons for the same and action items. Regards, Ashish