Hello community, I have, recently, faced the following problem in a Cassandra cluster. There were 2 datacenters with 15 Cassandra nodes each on 4.1.x version. Some of the Cassandra nodes were gracefully stopped for a couple of hours for administrative purposes. After some time since those Cassandra nodes were started again, other Cassandra nodes started reporting long GC pauses. The situation deteriorated over time resulting in some of them restarting due to OOM. The rest of the impacted Cassandra node, that did not restart due to OOM, were administratively restarted and the system was fully recovered. I suppose that some background operation was keeping the impacted Cassandra nodes busy, and the symptom was the intensive use of Heap memory and thus the long GC pauses which caused a major performance hit. My main question is if you are aware of any possible ways to be able to identify what a Cassandra node internally does to facilitate troubleshooting of such cases. My ideas so far are to produce and analyze a heap dump of the Cassandra process of a misbehaving Cassandra node and to collect and analyze the Thread Pool statistics provided by the JMX interface. Do you have similar troubleshooting requirements in your deplyments and if yes what did you do? Are you aware of any article around the specific topic? Thank you in advance!
BR MK
