Re: Hints replays very slow in one DC

Elliott Sims Thu, 27 Feb 2020 12:45:40 -0800

I definitely saw a noticeable decrease in GC activity somewhere between
3.11.0 and 3.11.4.  I'm not sure which change did it, but I can't think of
any good reason to use 3.11.0 vs 3.11.6.


I would enable and look through GC logs (or just the slow-GC entries in the
default log) to see if the problem is that it's actually running out of
heap vs falling behind on GC.  For example, if it's doing long mixed or
full GCs and the old-gen space isn't shrinking much it's probably just too
much total data.  If it's just falling behind, there's some things like
InitiatingHeapOccupancyPercent you can tune.

It might also be worth looking at "ttop" from
https://github.com/aragozin/jvm-tools and sorting by heap allocation to see
if you can identify top offenders.

On Thu, Feb 27, 2020 at 9:59 AM Krish Donald <gotomyp...@gmail.com> wrote:

> Thanks everyone for the response.
> How to debug more on GC issue ?
> Is there any GC issue which is present in 3.11.0 ?
>
> On Thu, Feb 27, 2020 at 8:46 AM Reid Pinchback <rpinchb...@tripadvisor.com>
> wrote:
>
>> Our experience with G1GC was that 31gb wasn’t optimal (for us) because
>> while you have less frequent full GCs they are bigger when they do happen.
>> But even so, not to the point of a 9.5s full collection.
>>
>>
>>
>> Unless it is a rare event associated with something weird happening
>> outside of the JVM (there are some whacky interactions between memory and
>> dirty page writing that could cause it, but not typically), then that is
>> evidence of a really tough fight to reclaim memory.  There are a lot of
>> things that can impact garbage collection performance.  Something is either
>> being pushed very hard, or something is being constrained very tightly
>> compared to resource demand.
>>
>>
>>
>> I’m with Erick, I wouldn’t be putting my attention right now on anything
>> but the GC issue. Everything else that happens within the JVM envelope is
>> going to be a misread on timing until you have stable garbage collection.
>> You might have other issues later, but you aren’t going to know what those
>> are yet.
>>
>>
>>
>> One thing you could at least try to eliminate quickly as a factor.  Are
>> repairs running at the time that things are slow?  In prior to 3.11.5 you
>> lack one of the tuning knobs for doing a tradeoff on memory vs network
>> bandwidth when doing repairs.
>>
>>
>>
>> I’d also make sure you have tuned C* to migrate whatever you reasonably
>> can to be off-heap.
>>
>>
>>
>> Another thought for surprise demands on memory.  I don’t know if this is
>> in 3.11.0, you’ll have to check the C* bash scripts for launching the
>> service.  The number of malloc arenas haven’t always been curtailed, and
>> that could result in an explosion in memory demand.  I just don’t recall
>> where in C* version history that was addressed.
>>
>>
>>
>>
>>
>> *From: *Erick Ramirez <erick.rami...@datastax.com>
>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Date: *Wednesday, February 26, 2020 at 9:55 PM
>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Subject: *Re: Hints replays very slow in one DC
>>
>>
>>
>> *Message from External Sender*
>>
>> Nodes are going down due to Out of Memory and we are using 31GB heap size
>> in DC1 , however in DC2 (Which serves the traffic) has 16GB heap .
>>
>> Why we had to increase heap in DC1 is because , DC1 nodes were going down
>> due Out of Memory issue but DC2 nodes never went down .
>>
>>
>>
>> It doesn't sound right that the primary DC is DC2 but DC1 is under load.
>> You might not be aware of it but the symptom suggests DC1 is getting hit
>> with lots of traffic. If you run netstat (or whatever utility/tool of
>> your choice), you should see established connections to the cluster. That
>> should give you clues as to where it's coming from.
>>
>>
>>
>> We also noticed below kind of messages in system.log
>>
>> FailureDetector.java:288 - Not marking nodes down due to local pause of
>> 9532654114 > 5000000000
>>
>>
>>
>> That's another smoking gun that the nodes are buried in GC. A 9.5-second
>> pause is significant. The slow hinted handoffs is really the least of your
>> problem right now. If nodes weren't going down, there wouldn't be hints to
>> handoff in the first place. Cheers!
>>
>>
>>
>> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have
>> answers! Share your expertise on https://community.datastax.com/
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__community.datastax.com_&d=DwMFaQ&c=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA&r=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc&m=C0gRic-Qm5s2TDaPnWIg9ki0Zfc99_sNxDDPBTS4Sqw&s=ts13dLS5C9fN0TvYJQmSKlqMnSHpS-j3blE22HMedsg&e=>
>> .
>>
>

Re: Hints replays very slow in one DC

Reply via email to