The young gcs loom suspicious

Without seeing the heap it’s hard to be sure, but be sure you’re adjusting your 
memtable (size and flush threshold), and you may find moving it offheap helps 

Beyond that, debugging usually takes a heap dump and inspection with yourkit or 
MAT or similar reads like a Datastax version - I know there’s a few reports of 
recyclers not working great in 3.11.x but haven’t seen many heap related leak 
concerns with 3.0.14

Jeff Jirsa

> On Jun 30, 2018, at 5:49 PM, Tunay Gür <> wrote:
> Dear Cassandra users, 
> I'm observing high coordinator latencies (spikes going over 1sec for P99) 
> without corresponding keyspace read latencies. After researching this list 
> and public web, I focused my investigation around GC, but still couldn't 
> convince myself %100 (mainly because my lack of experience in JVM GC and 
> Cassandra behavior). I'd appreciate if you can help me out. 
> Setup:
> - 2DC 40 nodes each 
> - Cassandra Version:
> - G1GC
> - -Xms30500M -Xmx30500M
> - Traffic mix:  20K continuous RPS  + 10K continuous WPS + 40K WPS daily bulk 
> ingestion (for 2 hours)
> - Row cache disabled, Keycache 5GB capacity
> Some observations:
> - I don't have clear repro steps, but I feel like high coordinator latencies 
> gets triggered by some sudden change in traffic (i.e bulk ingestion or DC 
> failover). For example last time it happened, bulk ingestion triggered it and 
> coordinator latencies keep spiraling up until I drain some of the traffic: 
> <Screen Shot 2018-06-30 at 5.15.31 PM.png>
> ​
> - I see corresponding increase in GC warning logs that looks similar to this: 
> G1 Young Generation GC in 3543ms.  G1 Eden Space: 1535115264 -> 0; G1 Old 
> Gen: 14851011568 -> 14585937368; G1 Survivor Space: 58720256 -> 83886080; 
> - Also I see the following warnings every once in a while: 
> Not marking nodes down due to local pause of 5169439644 > 5000000000
> - Looks like cluster goes into this state after a while, maybe after 10 days 
> or so. Restarting cluster helps. When things are working I've seen this 
> cluster handling 1M RPS without a problem. 
> - I don't have root access on the machines but I can collect GC logs, I'm not 
> sure if I interpret them correctly but one observation is that a lot more 
> young gen GC happening with less memory reclaimed during latency spikes.
