RE: Cassandra and G1 Garbage collector stop the world event (STW)

2017-10-09 Thread Steinmaurer, Thomas
Hi,

my previously mentioned G1 bug does not seem to be related to your case

Thomas

From: Gustavo Scudeler [mailto:scudel...@gmail.com]
Sent: Montag, 09. Oktober 2017 15:13
To: user@cassandra.apache.org
Subject: Re: Cassandra and G1 Garbage collector stop the world event (STW)

Hello,

@kurt greaves: Have you tried CMS with that sized heap?

Yes, for testing for testing purposes, I have 3 nodes with CMS and 3 with G1. 
The behavior is basically the same.


Using CMS suggested settings 
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTAtNDk=

Using G1 suggested settings 
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTExLTE3


@Steinmaurer, Thomas If this happens in a very short very frequently and 
depending on your allocation rate in MB/s, a combination of the G1 bug and a 
small heap, might result going towards OOM.

We have a really high obj allocation rate:

Avg creation rate

622.9 mb/sec


Avg promotion rate

18.39 mb/sec


It could be the cause, where the GC can't keep up with this rate.

I'm stating to think it could be some wrong configuration where Cassandra is 
configured in a way that bursts allocations in a manner that G1 can't keep up 
with.

Any ideas?

Best regards,


2017-10-09 12:44 GMT+01:00 Steinmaurer, Thomas 
<thomas.steinmau...@dynatrace.com<mailto:thomas.steinmau...@dynatrace.com>>:
Hi,

although not happening here with Cassandra (due to using CMS), we had some 
weird problem with our server application e.g. hit by the following JVM/G1 bugs:
https://bugs.openjdk.java.net/browse/JDK-8140597
https://bugs.openjdk.java.net/browse/JDK-8141402 (more or less  a duplicate of 
above)
https://bugs.openjdk.java.net/browse/JDK-8048556

Especially the first, JDK-8140597, might be interesting, if you see periodic 
humongous allocations (according to a GC log) resulting in mixed GC phases 
being steadily interrupted due to G1 bug, thus no GC in OLD regions. Humongous 
allocations will happen if a single (?) allocation is > (region size / 2), if I 
remember correctly. Can’t recall the default G1 region size for a 12GB heap, 
but possibly 4MB. So, in case you are allocating something larger than > 2MB, 
you might end up in something called “humongous” allocations, spanning several 
G1 regions. If this happens in a very short very frequently and depending on 
your allocation rate in MB/s, a combination of the G1 bug and a small heap, 
might result going towards OOM.

Possibly worth a further route for investigation.

Regards,
Thomas

From: Gustavo Scudeler [mailto:scudel...@gmail.com<mailto:scudel...@gmail.com>]
Sent: Montag, 09. Oktober 2017 13:12
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Cassandra and G1 Garbage collector stop the world event (STW)


Hi guys,

We have a 6 node Cassandra Cluster under heavy utilization. We have been 
dealing a lot with garbage collector stop the world event, which can take up to 
50 seconds in our nodes, in the meantime Cassandra Node is unresponsive, not 
even accepting new logins.

Extra details:
• Cassandra Version: 3.11
• Heap Size = 12 GB
• We are using G1 Garbage Collector with default settings
• Nodes size: 4 CPUs 28 GB RAM
• All CPU cores are at 100% all the time.
• The G1 GC behavior is the same across all nodes.

The behavior remains basically:
1.  Old Gen starts to fill up.
2.  GC can't clean it properly without a full GC and a STW event.
3.  The full GC starts to take longer, until the node is completely 
unresponsive.
Extra details and GC reports:
https://stackoverflow.com/questions/46568777/cassandra-and-g1-garbage-collector-stop-the-world-event-stw

Can someone point me what configurations or events I could check?

Thanks!

Best regards,

The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 
313<https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>



The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313


Re: Cassandra and G1 Garbage collector stop the world event (STW)

2017-10-09 Thread Chris Lohfink
Can you share your schema and cfstats? This sounds kinda like a wide
partition, backed up compactions, or tombstone issue for it to create so
much and have issues like that so quickly with those settings.

A heap dump would be most telling but they are rather large and hard to
share.

Chris

On Mon, Oct 9, 2017 at 8:12 AM, Gustavo Scudeler 
wrote:

> Hello,
>
> @kurt greaves: Have you tried CMS with that sized heap?
>
>
> Yes, for testing for testing purposes, I have 3 nodes with CMS and 3 with
> G1. The behavior is basically the same.
>
> *Using CMS suggested settings* http://gceasy.io/my-gc-report.jsp?p=
> c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTAtNDk=
>
> *Using G1 suggested settings* http://gceasy.io/my-gc-report.jsp?p=
> c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTExLTE3
>
>
> @Steinmaurer, Thomas If this happens in a very short very frequently and
>> depending on your allocation rate in MB/s, a combination of the G1 bug and
>> a small heap, might result going towards OOM.
>
>
> We have a really high obj allocation rate:
>
> Avg creation rate  622.9 mb/sec
> Avg promotion rate  18.39 mb/sec
>
> It could be the cause, where the GC can't keep up with this rate.
>
> I'm stating to think it could be some wrong configuration where Cassandra is
> configured in a way that bursts allocations in a manner that G1 can't keep
> up with.
>
> Any ideas?
>
> Best regards,
>
>
> 2017-10-09 12:44 GMT+01:00 Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com>:
>
>> Hi,
>>
>>
>>
>> although not happening here with Cassandra (due to using CMS), we had
>> some weird problem with our server application e.g. hit by the following
>> JVM/G1 bugs:
>>
>> https://bugs.openjdk.java.net/browse/JDK-8140597
>>
>> https://bugs.openjdk.java.net/browse/JDK-8141402 (more or less  a
>> duplicate of above)
>>
>> https://bugs.openjdk.java.net/browse/JDK-8048556
>>
>>
>>
>> Especially the first, JDK-8140597, might be interesting, if you see
>> periodic humongous allocations (according to a GC log) resulting in mixed
>> GC phases being steadily interrupted due to G1 bug, thus no GC in OLD
>> regions. Humongous allocations will happen if a single (?) allocation is >
>> (region size / 2), if I remember correctly. Can’t recall the default G1
>> region size for a 12GB heap, but possibly 4MB. So, in case you are
>> allocating something larger than > 2MB, you might end up in something
>> called “humongous” allocations, spanning several G1 regions. If this
>> happens in a very short very frequently and depending on your allocation
>> rate in MB/s, a combination of the G1 bug and a small heap, might result
>> going towards OOM.
>>
>>
>>
>> Possibly worth a further route for investigation.
>>
>>
>>
>> Regards,
>>
>> Thomas
>>
>>
>>
>> *From:* Gustavo Scudeler [mailto:scudel...@gmail.com]
>> *Sent:* Montag, 09. Oktober 2017 13:12
>> *To:* user@cassandra.apache.org
>> *Subject:* Cassandra and G1 Garbage collector stop the world event (STW)
>>
>>
>>
>> Hi guys,
>>
>> We have a 6 node Cassandra Cluster under heavy utilization. We have been
>> dealing a lot with garbage collector stop the world event, which can take
>> up to 50 seconds in our nodes, in the meantime Cassandra Node is
>> unresponsive, not even accepting new logins.
>>
>> Extra details:
>>
>> · Cassandra Version: 3.11
>>
>> · Heap Size = 12 GB
>>
>> · We are using G1 Garbage Collector with default settings
>>
>> · Nodes size: 4 CPUs 28 GB RAM
>>
>> · All CPU cores are at 100% all the time.
>>
>> · The G1 GC behavior is the same across all nodes.
>>
>> The behavior remains basically:
>>
>> 1.  Old Gen starts to fill up.
>>
>> 2.  GC can't clean it properly without a full GC and a STW event.
>>
>> 3.  The full GC starts to take longer, until the node is completely
>> unresponsive.
>>
>> *Extra details and GC reports:*
>>
>> https://stackoverflow.com/questions/46568777/cassandra-and-
>> g1-garbage-collector-stop-the-world-event-stw
>>
>>
>>
>> Can someone point me what configurations or events I could check?
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Best regards,
>>
>>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>> 
>>
>
>
>
>


Re: Cassandra and G1 Garbage collector stop the world event (STW)

2017-10-09 Thread Gustavo Scudeler
Hello,

@kurt greaves: Have you tried CMS with that sized heap?


Yes, for testing for testing purposes, I have 3 nodes with CMS and 3 with
G1. The behavior is basically the same.

*Using CMS suggested settings*
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTAtNDk=

*Using G1 suggested settings*
http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTcvMTAvOC8tLWdjLmxvZy4wLmN1cnJlbnQtLTE5LTExLTE3


@Steinmaurer, Thomas If this happens in a very short very frequently and
> depending on your allocation rate in MB/s, a combination of the G1 bug and
> a small heap, might result going towards OOM.


We have a really high obj allocation rate:

Avg creation rate  622.9 mb/sec
Avg promotion rate  18.39 mb/sec

It could be the cause, where the GC can't keep up with this rate.

I'm stating to think it could be some wrong configuration where Cassandra is
configured in a way that bursts allocations in a manner that G1 can't keep
up with.

Any ideas?

Best regards,


2017-10-09 12:44 GMT+01:00 Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com>:

> Hi,
>
>
>
> although not happening here with Cassandra (due to using CMS), we had some
> weird problem with our server application e.g. hit by the following JVM/G1
> bugs:
>
> https://bugs.openjdk.java.net/browse/JDK-8140597
>
> https://bugs.openjdk.java.net/browse/JDK-8141402 (more or less  a
> duplicate of above)
>
> https://bugs.openjdk.java.net/browse/JDK-8048556
>
>
>
> Especially the first, JDK-8140597, might be interesting, if you see
> periodic humongous allocations (according to a GC log) resulting in mixed
> GC phases being steadily interrupted due to G1 bug, thus no GC in OLD
> regions. Humongous allocations will happen if a single (?) allocation is >
> (region size / 2), if I remember correctly. Can’t recall the default G1
> region size for a 12GB heap, but possibly 4MB. So, in case you are
> allocating something larger than > 2MB, you might end up in something
> called “humongous” allocations, spanning several G1 regions. If this
> happens in a very short very frequently and depending on your allocation
> rate in MB/s, a combination of the G1 bug and a small heap, might result
> going towards OOM.
>
>
>
> Possibly worth a further route for investigation.
>
>
>
> Regards,
>
> Thomas
>
>
>
> *From:* Gustavo Scudeler [mailto:scudel...@gmail.com]
> *Sent:* Montag, 09. Oktober 2017 13:12
> *To:* user@cassandra.apache.org
> *Subject:* Cassandra and G1 Garbage collector stop the world event (STW)
>
>
>
> Hi guys,
>
> We have a 6 node Cassandra Cluster under heavy utilization. We have been
> dealing a lot with garbage collector stop the world event, which can take
> up to 50 seconds in our nodes, in the meantime Cassandra Node is
> unresponsive, not even accepting new logins.
>
> Extra details:
>
> · Cassandra Version: 3.11
>
> · Heap Size = 12 GB
>
> · We are using G1 Garbage Collector with default settings
>
> · Nodes size: 4 CPUs 28 GB RAM
>
> · All CPU cores are at 100% all the time.
>
> · The G1 GC behavior is the same across all nodes.
>
> The behavior remains basically:
>
> 1.  Old Gen starts to fill up.
>
> 2.  GC can't clean it properly without a full GC and a STW event.
>
> 3.  The full GC starts to take longer, until the node is completely
> unresponsive.
>
> *Extra details and GC reports:*
>
> https://stackoverflow.com/questions/46568777/cassandra-
> and-g1-garbage-collector-stop-the-world-event-stw
>
>
>
> Can someone point me what configurations or events I could check?
>
>
>
> Thanks!
>
>
>
> Best regards,
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
> 
>


RE: Cassandra and G1 Garbage collector stop the world event (STW)

2017-10-09 Thread Steinmaurer, Thomas
Hi,

although not happening here with Cassandra (due to using CMS), we had some 
weird problem with our server application e.g. hit by the following JVM/G1 bugs:
https://bugs.openjdk.java.net/browse/JDK-8140597
https://bugs.openjdk.java.net/browse/JDK-8141402 (more or less  a duplicate of 
above)
https://bugs.openjdk.java.net/browse/JDK-8048556

Especially the first, JDK-8140597, might be interesting, if you see periodic 
humongous allocations (according to a GC log) resulting in mixed GC phases 
being steadily interrupted due to G1 bug, thus no GC in OLD regions. Humongous 
allocations will happen if a single (?) allocation is > (region size / 2), if I 
remember correctly. Can’t recall the default G1 region size for a 12GB heap, 
but possibly 4MB. So, in case you are allocating something larger than > 2MB, 
you might end up in something called “humongous” allocations, spanning several 
G1 regions. If this happens in a very short very frequently and depending on 
your allocation rate in MB/s, a combination of the G1 bug and a small heap, 
might result going towards OOM.

Possibly worth a further route for investigation.

Regards,
Thomas

From: Gustavo Scudeler [mailto:scudel...@gmail.com]
Sent: Montag, 09. Oktober 2017 13:12
To: user@cassandra.apache.org
Subject: Cassandra and G1 Garbage collector stop the world event (STW)


Hi guys,

We have a 6 node Cassandra Cluster under heavy utilization. We have been 
dealing a lot with garbage collector stop the world event, which can take up to 
50 seconds in our nodes, in the meantime Cassandra Node is unresponsive, not 
even accepting new logins.

Extra details:
· Cassandra Version: 3.11
· Heap Size = 12 GB
· We are using G1 Garbage Collector with default settings
· Nodes size: 4 CPUs 28 GB RAM
· All CPU cores are at 100% all the time.
· The G1 GC behavior is the same across all nodes.

The behavior remains basically:
1.  Old Gen starts to fill up.
2.  GC can't clean it properly without a full GC and a STW event.
3.  The full GC starts to take longer, until the node is completely 
unresponsive.
Extra details and GC reports:
https://stackoverflow.com/questions/46568777/cassandra-and-g1-garbage-collector-stop-the-world-event-stw

Can someone point me what configurations or events I could check?

Thanks!

Best regards,

The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313


Re: Cassandra and G1 Garbage collector stop the world event (STW)

2017-10-09 Thread kurt greaves
Have you tried CMS with that sized heap? G1 is only really worthwhile with
24gb+ heap size, which wouldn't really make sense on machines with 28gb of
RAM. In general CMS is found to work better for C*, leaving excess memory
to be utilised by the OS page cache​