Hello community,
Thanks a lot for your replies. I will start with collecting statistics which is 
less intrusive. The async profiler seems useful. I will check this in my lab 
nodes.
If I have any useful finding from my troubleshooting, that worths to be shared 
with the Cassandra users community, I will come back with a post.

BR
MK

From: Jon Haddad <[email protected]>
Sent: December 01, 2025 23:46
To: [email protected]
Cc: Michalis Kotsiouros (EXT) <[email protected]>; Elliott 
Sims <[email protected]>
Subject: Re: Troubleshooting internal Cassandra operations.

+1 to a heap dump, but with some caveats.  They're great if you know what 
you're looking for, and are familiar with the tooling around them, but I've run 
into issues with large heap dumps in the past that were effectively unusable 
because every tool I tried would either lock up or crash.
These days I usually reach for the async-profiler.  If you want to know what's 
being allocated at any given window of time, use the `-e alloc` mode and you 
can find out pretty quickly where your allocations are coming from.  
CASSANDRA-20428 is a good example, where I found that compaction has a single 
call generating 40% of allocations.

It wouldn't surprise me if there were a ton of hints sent over (since the nodes 
were was down for hours), then lots of pressure from unthrottled compaction and 
/ or a small heap or small new gen caused the old gen to get flooded with 
objects.  Just a guess, there's not much to go with in the original question.

Jon

[1] https://issues.apache.org/jira/browse/CASSANDRA-20428

On Mon, Dec 1, 2025 at 1:04 PM Elliott Sims via user 
<[email protected]<mailto:[email protected]>> wrote:
A heap dump is a good start.  You can also turn on detailed GC logging and look 
through that.  I definitely find it useful to check "heap size after full GC" 
(via jconsole, collected metrics, GC logging, or tools like jstat or nodetool 
gcstats) and heap allocation rate to figure out if it's a problem of "heap too 
small for live data-set" vs "GC can't keep up".  "nodetool sjk ttop -o ALLOC" 
can give you a good idea of both allocation rate and what's doing the 
allocating.

There's lots of commercial tools, but Eclipse MAT's heap analyzer does a decent 
job of finding major heap space consumers.  It requires jumping through some 
extra hoops for heaps that are large relative to local memory, though.

On Fri, Nov 28, 2025 at 3:01 AM Michalis Kotsiouros (EXT) via user 
<[email protected]<mailto:[email protected]>> wrote:
Hello community,
I have, recently, faced the following problem in a Cassandra cluster. There 
were 2 datacenters with 15 Cassandra nodes each on 4.1.x version.
Some of the Cassandra nodes were gracefully stopped for a couple of hours for 
administrative purposes.
After some time since those Cassandra nodes were started again, other Cassandra 
nodes started reporting long GC pauses. The situation deteriorated over time 
resulting in some of them restarting due to OOM. The rest  of the impacted 
Cassandra node, that did not restart due to OOM, were administratively 
restarted and the system was fully recovered.
I suppose that some background operation was keeping the impacted Cassandra 
nodes busy, and the symptom was the intensive use of Heap memory and thus the 
long GC pauses which caused a major performance hit.
My main question is if you are aware of any possible ways to be able to 
identify what a Cassandra node internally does to facilitate troubleshooting of 
such cases. My ideas so far are to produce and analyze a heap dump of the 
Cassandra process of a misbehaving Cassandra node and to collect and analyze 
the Thread Pool statistics provided by the JMX interface. Do you have similar 
troubleshooting requirements in your deplyments and if yes what did you do? Are 
you aware of any article around the specific topic?
Thank you in advance!

BR
MK

This email, including its contents and any attachment(s), may contain 
confidential and/or proprietary information and is solely for the review and 
use of the intended recipient(s). If you have received this email in error, 
please notify the sender and permanently delete this email, its content, and 
any attachment(s). Any disclosure, copying, or taking of any action in reliance 
on an email received in error is strictly prohibited.

Reply via email to