Hello community,
I have, recently, faced the following problem in a Cassandra cluster. There 
were 2 datacenters with 15 Cassandra nodes each on 4.1.x version.
Some of the Cassandra nodes were gracefully stopped for a couple of hours for 
administrative purposes.
After some time since those Cassandra nodes were started again, other Cassandra 
nodes started reporting long GC pauses. The situation deteriorated over time 
resulting in some of them restarting due to OOM. The rest  of the impacted 
Cassandra node, that did not restart due to OOM, were administratively 
restarted and the system was fully recovered.
I suppose that some background operation was keeping the impacted Cassandra 
nodes busy, and the symptom was the intensive use of Heap memory and thus the 
long GC pauses which caused a major performance hit.
My main question is if you are aware of any possible ways to be able to 
identify what a Cassandra node internally does to facilitate troubleshooting of 
such cases. My ideas so far are to produce and analyze a heap dump of the 
Cassandra process of a misbehaving Cassandra node and to collect and analyze 
the Thread Pool statistics provided by the JMX interface. Do you have similar 
troubleshooting requirements in your deplyments and if yes what did you do? Are 
you aware of any article around the specific topic?
Thank you in advance!

BR
MK

Reply via email to