Thanks for the recommendation Jeff, I'll try to get a heap dump next time this happens and try the other changes in the mean time.
Also not sure but this CASSANDRA-13900 looked it might be related. On Sat, Jun 30, 2018 at 9:51 PM, Jeff Jirsa <jji...@gmail.com> wrote: > The young gcs loom suspicious > > Without seeing the heap it’s hard to be sure, but be sure you’re adjusting > your memtable (size and flush threshold), and you may find moving it > offheap helps too > > Beyond that, debugging usually takes a heap dump and inspection with > yourkit or MAT or similar > > 3.0.14.10 reads like a Datastax version - I know there’s a few reports of > recyclers not working great in 3.11.x but haven’t seen many heap related > leak concerns with 3.0.14 > > -- > Jeff Jirsa > > > On Jun 30, 2018, at 5:49 PM, Tunay Gür <tunay...@gmail.com> wrote: > > Dear Cassandra users, > > I'm observing high coordinator latencies (spikes going over 1sec for P99) > without corresponding keyspace read latencies. After researching this list > and public web, I focused my investigation around GC, but still couldn't > convince myself %100 (mainly because my lack of experience in JVM GC and > Cassandra behavior). I'd appreciate if you can help me out. > > *Setup:* > - 2DC 40 nodes each > - Cassandra Version: 3.0.14.10 > - G1GC > - -Xms30500M -Xmx30500M > - Traffic mix: 20K continuous RPS + 10K continuous WPS + 40K WPS daily > bulk ingestion (for 2 hours) > - Row cache disabled, Keycache 5GB capacity > > *Some observations:* > - I don't have clear repro steps, but I feel like high coordinator > latencies gets triggered by some sudden change in traffic (i.e bulk > ingestion or DC failover). For example last time it happened, bulk > ingestion triggered it and coordinator latencies keep spiraling up until I > drain some of the traffic: > > <Screen Shot 2018-06-30 at 5.15.31 PM.png> > > - I see corresponding increase in GC warning logs that looks similar to > this: > > G1 Young Generation GC in 3543ms. G1 Eden Space: 1535115264 -> 0; G1 Old > Gen: 14851011568 -> 14585937368; G1 Survivor Space: 58720256 -> 83886080; > - Also I see the following warnings every once in a while: > > Not marking nodes down due to local pause of 5169439644 > 5000000000 > > - Looks like cluster goes into this state after a while, maybe after 10 > days or so. Restarting cluster helps. When things are working I've seen > this cluster handling 1M RPS without a problem. > > - I don't have root access on the machines but I can collect GC logs, I'm > not sure if I interpret them correctly but one observation is that a lot > more young gen GC happening with less memory reclaimed during latency > spikes. > > Anything else I can do to conclude whether this is GC related or not ? > > > > > > > >