Hi,
We have a cluster of 4 nodes all in one DC (apache cass version : 3.11.6).
Things were working fine till last month when all of a sudden we started
facing Operation time outs at client end intermittently.

We have prometheus+grafana configured for monitoring.
On checking we found the following points:
1. Read/Write latency at coordinator level increase at the same time on one
or multiple nodes.
2. jvm_threads_current increases at the same time on one or multiple nodes.
3. Cassandra hints storage  increases at the same time on one or multiple
nodes.
4. Increase in Client Requests + Connection timeout  with dropped messages
at times.
5. Increase in connected native clients count on all nodes

Things already checked :
1. No change in read/write requests.
2. No major change in table level read/write latency.
3. No spike on DB load or CPU utilization.
4. Memory usage is also normal.
5. GC Pauses are also normal.
6. No packet loss between nodes at network level

On checking detailed logs, we found majorly two types of messages during
timeouts.
1. HintsDispatcher : Dispatching hints from one node to other
2.  READ messages were dropped in last 5000 ms: 0 internal and 1 cross
node. Mean internal dropped latency: 0 ms and Mean cross-node dropped
latency: 14831 ms
3. StatusLogger messages.

Please suggest possible reasons for the same and action items.

Regards,
Ashish

Reply via email to