Thanks for the recommendation Jeff, I'll try to get a heap dump next time
this happens and try the other changes in the mean time.

Also not sure but this CASSANDRA-13900 looked it might be related.

On Sat, Jun 30, 2018 at 9:51 PM, Jeff Jirsa <jji...@gmail.com> wrote:

> The young gcs loom suspicious
>
> Without seeing the heap it’s hard to be sure, but be sure you’re adjusting
> your memtable (size and flush threshold), and you may find moving it
> offheap helps too
>
> Beyond that, debugging usually takes a heap dump and inspection with
> yourkit or MAT or similar
>
> 3.0.14.10 reads like a Datastax version - I know there’s a few reports of
> recyclers not working great in 3.11.x but haven’t seen many heap related
> leak concerns with 3.0.14
>
> --
> Jeff Jirsa
>
>
> On Jun 30, 2018, at 5:49 PM, Tunay Gür <tunay...@gmail.com> wrote:
>
> Dear Cassandra users,
>
> I'm observing high coordinator latencies (spikes going over 1sec for P99)
> without corresponding keyspace read latencies. After researching this list
> and public web, I focused my investigation around GC, but still couldn't
> convince myself %100 (mainly because my lack of experience in JVM GC and
> Cassandra behavior). I'd appreciate if you can help me out.
>
> *Setup:*
> - 2DC 40 nodes each
> - Cassandra Version: 3.0.14.10
> - G1GC
> - -Xms30500M -Xmx30500M
> - Traffic mix:  20K continuous RPS  + 10K continuous WPS + 40K WPS daily
> bulk ingestion (for 2 hours)
> - Row cache disabled, Keycache 5GB capacity
>
> *Some observations:*
> - I don't have clear repro steps, but I feel like high coordinator
> latencies gets triggered by some sudden change in traffic (i.e bulk
> ingestion or DC failover). For example last time it happened, bulk
> ingestion triggered it and coordinator latencies keep spiraling up until I
> drain some of the traffic:
>
> <Screen Shot 2018-06-30 at 5.15.31 PM.png>
> ​
> - I see corresponding increase in GC warning logs that looks similar to
> this:
>
> G1 Young Generation GC in 3543ms. G1 Eden Space: 1535115264 -> 0; G1 Old
> Gen: 14851011568 -> 14585937368; G1 Survivor Space: 58720256 -> 83886080;
> - Also I see the following warnings every once in a while:
>
> Not marking nodes down due to local pause of 5169439644 > 5000000000
>
> - Looks like cluster goes into this state after a while, maybe after 10
> days or so. Restarting cluster helps. When things are working I've seen
> this cluster handling 1M RPS without a problem.
>
> - I don't have root access on the machines but I can collect GC logs, I'm
> not sure if I interpret them correctly but one observation is that a lot
> more young gen GC happening with less memory reclaimed during latency
> spikes.
>
> Anything else I can do to conclude whether this is GC related or not ?
>
>
>
>
>
>
>
>

Reply via email to