Hello community, Thanks a lot for your replies. I will start with collecting statistics which is less intrusive. The async profiler seems useful. I will check this in my lab nodes. If I have any useful finding from my troubleshooting, that worths to be shared with the Cassandra users community, I will come back with a post.
BR MK From: Jon Haddad <[email protected]> Sent: December 01, 2025 23:46 To: [email protected] Cc: Michalis Kotsiouros (EXT) <[email protected]>; Elliott Sims <[email protected]> Subject: Re: Troubleshooting internal Cassandra operations. +1 to a heap dump, but with some caveats. They're great if you know what you're looking for, and are familiar with the tooling around them, but I've run into issues with large heap dumps in the past that were effectively unusable because every tool I tried would either lock up or crash. These days I usually reach for the async-profiler. If you want to know what's being allocated at any given window of time, use the `-e alloc` mode and you can find out pretty quickly where your allocations are coming from. CASSANDRA-20428 is a good example, where I found that compaction has a single call generating 40% of allocations. It wouldn't surprise me if there were a ton of hints sent over (since the nodes were was down for hours), then lots of pressure from unthrottled compaction and / or a small heap or small new gen caused the old gen to get flooded with objects. Just a guess, there's not much to go with in the original question. Jon [1] https://issues.apache.org/jira/browse/CASSANDRA-20428 On Mon, Dec 1, 2025 at 1:04 PM Elliott Sims via user <[email protected]<mailto:[email protected]>> wrote: A heap dump is a good start. You can also turn on detailed GC logging and look through that. I definitely find it useful to check "heap size after full GC" (via jconsole, collected metrics, GC logging, or tools like jstat or nodetool gcstats) and heap allocation rate to figure out if it's a problem of "heap too small for live data-set" vs "GC can't keep up". "nodetool sjk ttop -o ALLOC" can give you a good idea of both allocation rate and what's doing the allocating. There's lots of commercial tools, but Eclipse MAT's heap analyzer does a decent job of finding major heap space consumers. It requires jumping through some extra hoops for heaps that are large relative to local memory, though. On Fri, Nov 28, 2025 at 3:01 AM Michalis Kotsiouros (EXT) via user <[email protected]<mailto:[email protected]>> wrote: Hello community, I have, recently, faced the following problem in a Cassandra cluster. There were 2 datacenters with 15 Cassandra nodes each on 4.1.x version. Some of the Cassandra nodes were gracefully stopped for a couple of hours for administrative purposes. After some time since those Cassandra nodes were started again, other Cassandra nodes started reporting long GC pauses. The situation deteriorated over time resulting in some of them restarting due to OOM. The rest of the impacted Cassandra node, that did not restart due to OOM, were administratively restarted and the system was fully recovered. I suppose that some background operation was keeping the impacted Cassandra nodes busy, and the symptom was the intensive use of Heap memory and thus the long GC pauses which caused a major performance hit. My main question is if you are aware of any possible ways to be able to identify what a Cassandra node internally does to facilitate troubleshooting of such cases. My ideas so far are to produce and analyze a heap dump of the Cassandra process of a misbehaving Cassandra node and to collect and analyze the Thread Pool statistics provided by the JMX interface. Do you have similar troubleshooting requirements in your deplyments and if yes what did you do? Are you aware of any article around the specific topic? Thank you in advance! BR MK This email, including its contents and any attachment(s), may contain confidential and/or proprietary information and is solely for the review and use of the intended recipient(s). If you have received this email in error, please notify the sender and permanently delete this email, its content, and any attachment(s). Any disclosure, copying, or taking of any action in reliance on an email received in error is strictly prohibited.
