Re: Nodes get stuck in crazy GC loop after some time, leading to timeouts

graham sanderson Fri, 28 Nov 2014 16:59:05 -0800

I should note that the young gen size is just a tuning suggestion, not directly 
related to your problem at hand.


You might want to make sure you don’t have issues with key/row cache.

Also, I’m assuming that your extra load isn’t hitting tables that you wouldn’t 
normally be hitting.

> On Nov 28, 2014, at 6:54 PM, graham sanderson <gra...@vast.com> wrote:
> 
> Your GC settings would be helpful, though you can see guesstimate by 
> eyeballing (assuming settings are the same across all 4 images)
> 
> Bursty load can be a big cause of old gen fragmentation (as small working set 
> objects tends to get spilled (promoted) along with memtable slabs which 
> aren’t flushed quickly enough). That said, empty fragmentation holes wouldn’t 
> show up as “used” in your graph, and that clearly looks like you are above 
> your CMSIniatingOccupancyFraction and CMS is running continuously, so they 
> probably aren’t the issue here.
> 
> Other than trying a slightly larger heap to give you more head room, I’d also 
> suggest from eyeballing that you have probably let the JVM pick its own new 
> gen size, and I’d suggest it is too small. What to set it to really depends 
> on your workload, but you could try something in the 0.5gig range unless that 
> makes your young gen pauses too long. In that case (or indeed anyway) make 
> sure you also have the latest GC settings (e.g. 
> -XX:+CMSParallelInitialMarkEnabled -XX:+CMSEdenChunksRecordAlways) on newer 
> JVMs (to help the young gc pauses)
> 
>> On Nov 28, 2014, at 2:55 PM, Paulo Ricardo Motta Gomes 
>> <paulo.mo...@chaordicsystems.com <mailto:paulo.mo...@chaordicsystems.com>> 
>> wrote:
>> 
>> Hello,
>> 
>> This is a recurrent behavior of JVM GC in Cassandra that I never completely 
>> understood: when a node is UP for many days (or even months), or receives a 
>> very high load spike (3x-5x normal load), CMS GC pauses start becoming very 
>> frequent and slow, causing periodic timeouts in Cassandra. Trying to run GC 
>> manually doesn't free up memory. The only solution when a node reaches this 
>> state is to restart the node.
>> 
>> We restart the whole cluster every 1 or 2 months, to avoid machines getting 
>> into this crazy state. We tried tuning GC size and parameters, different 
>> cassandra versions (1.1, 1.2, 2.0), but this behavior keeps happening. More 
>> recently, during black friday, we received about 5x our normal load, and 
>> some machines started presenting this behavior. Once again, we restart the 
>> nodes an the GC behaves normal again.
>> 
>> I'm attaching a few pictures comparing the heap of "healthy" and "sick" 
>> nodes: http://imgur.com/a/Tcr3w <http://imgur.com/a/Tcr3w>
>> 
>> You can clearly notice some memory is actually reclaimed during GC in 
>> healthy nodes, while in sick machines very little memory is reclaimed. Also, 
>> since GC is executed more frequently in sick machines, it uses about 2x more 
>> CPU than non-sick nodes.
>> 
>> Have you ever observed this behavior in your cluster? Could this be related 
>> to heap fragmentation? Would using the G1 collector help in this case? Any 
>> GC tuning or monitoring advice to troubleshoot this issue?
>> 
>> Any advice or pointers will be kindly appreciated.
>> 
>> Cheers,
>> 
>> -- 
>> Paulo Motta
>> 
>> Chaordic | Platform
>> www.chaordic.com.br <http://www.chaordic.com.br/>
>> +55 48 3232.3200
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Nodes get stuck in crazy GC loop after some time, leading to timeouts

Reply via email to