[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559669#comment-14559669 ] Albert P Tobey commented on CASSANDRA-7486: --- https://github.com/tobert/cassandra/tree/g1gc https://github.com/tobert/cassandra/commit/33bf6719e0c8e84672c3633f8ecce602affc3071 https://github.com/tobert/cassandra/commit/cafee86c3c5798e423689a26b43d05ed9312adc5 Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Albert P Tobey Fix For: 3.0 beta 1 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553067#comment-14553067 ] Albert P Tobey commented on CASSANDRA-7486: --- I'll attach a patch ASAP. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Albert P Tobey Fix For: 3.0 beta 1 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550741#comment-14550741 ] Albert P Tobey commented on CASSANDRA-7486: --- Did you run into evacuation failures? How big was your heap? I haven't seen any evac failures with 2.1 and Java 8. This is one of the things that was worked on for Hotspot 1.8. Then again maybe it's Solr that needs the help. I suspect you can remove a lot of these settings on Java 8, but have also discovered that setting the GC threads is necessary on many machines. Try adding the below line for a nice decrease in p99 latencies. JVM_OPTS=$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5 Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550721#comment-14550721 ] Michael Perrone commented on CASSANDRA-7486: I have done extensive load testing with G1GC with Java 1.7_80 and Cassandra 2.0.12.x versions with solr secondary indexes and 20GB max heap. On 8 core systems these options were the sweet spot for the test workload and worked out well in a production cluster, providing dramatic improvements in overall GC time and eliminating long CMS pauses that we could not tune out. I will try to attach some graphs/tables/metrics in another comment. {code:yaml} JVM_OPTS=$JVM_OPTS -XX:+UseG1GC # set these to the number of cores JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=8 JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=8 JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 # default is 10, this makes G1 slightly more aggressive # by starting the marking cycle earlier # in order to avoid evacuation failure (OOM) JVM_OPTS=$JVM_OPTS -XX:G1ReservePercent=15 # default is 45, we should start sooner. # in a high write large heap (20GB) this was # found to eliminate Old gen pauses JVM_OPTS=$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25 # default is 200 there is a tradeoff between latency # and throughput. increase this to increase throughput # at the cost of potential latency, up to 1000 JVM_OPTS=$JVM_OPTS -XX:MaxGCPauseMillis=500 # use largest possible region size # speeds up marking phase, tradeoff is efficiency # comment out to let JVM decide the size VM_OPTS=$JVM_OPTS -XX:G1HeapRegionSize=32 {code} Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551202#comment-14551202 ] Michael Perrone commented on CASSANDRA-7486: Have not seen evacuation failures, but test systems ran tight enough under heavy load to increase the reserve percentage. Heap was 20GB. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Albert P Tobey Fix For: 3.0 beta 1 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524278#comment-14524278 ] Matt Stump commented on CASSANDRA-7486: --- Before we talk about changing the defaults I would like to see tests run on something more representative of customer hardware. At the very least we should be doing comparisons of CMS vs G1 for different workloads on cstar. I did some initial testing and didn't see a huge benefit, but I very well could have been doing something wrong. I'm both hopeful and skeptical. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523593#comment-14523593 ] Albert P Tobey commented on CASSANDRA-7486: --- if I am reading correctly there was pretty never an old generation collection under the workload I looked at. The old gen was growing but never reached the point it needed to do an old gen GC. ^ G1 doesn't work that way. Another behavior to consider is worst case pause time when there is fragmentation. ^ G1 performs compaction. It's fairly easy to trigger and observe in gc.log with Cassandra 2.0. It takes more work with 2.1 since it seems to be easier on the GC. I'll see if I can find some time to generate graphs to make all this more convincing, but time is short because I'm spending all of my time tuning users' clusters where the #1 first issue every time is getting CMS to behave. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523630#comment-14523630 ] Ariel Weisberg commented on CASSANDRA-7486: --- bq. ^ G1 doesn't work that way. I am talking about CMS. When I looked at the 12 gigabyte heap the old gen grew to 4.1 gigabytes and I didn't see any point that it shrunk. bq. I'm spending all of my time tuning users' clusters where the #1 first issue every time is getting CMS to behave. We can make the case for G1 in different ways. If we want to do it based on real world results that is fine with me. To Benedict's point I think looking at all the operations we care about on realistic time scales is something we would have to do to really know what the differences are. I wish we had this stuff in CI so it would just be a matter of changing the flags, but we aren't there yet. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523724#comment-14523724 ] Albert P Tobey commented on CASSANDRA-7486: --- I only started messing with G1 this year, so I only know the old behavior by lore I've read and heard. I have not observed significant problems it in the ~20-40 hours I've spent tuning clusters with G1 recently. I don't recommend anyone try G1 on JDK 7 u75 or JDK 8 u40 (although it's probably OK down to u20 according to the docs I've read). I did some testing on JDK7u75 and it was stable but didn't spend much time on it since JDK8u40 gave a nice bump in performance (5-10% on a customer cluster) by just switching JDKs and nothing else. From what I've read about the reference clearing issues, there is a new-ish setting to enable parallel reference collection, -XX:+ParallelRefProcEnabled. The advice in the docs is to only turn it on if a significant amount of time is spent on RefProc collection, e.g. [Ref Proc: 5.2 ms]. I pulled that from a log I had handy and that is high enough that we might want to consider enabling the flag, but in most of my observations it hovers around 0.1ms under saturation load. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523736#comment-14523736 ] Benedict commented on CASSANDRA-7486: - bq. there is a new-ish setting to enable parallel reference collection Throughput was something like 5% of other collectors in my testing, so parallelizing this would only help so much! :) My point is, we don't really fully understand G1, and unless we undertake a research project to fully understand its pathological cases, and how they compare/contrast to its history, I'd prefer we ensured it behaved under complex loads, and not just under isolated read or write loading. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523621#comment-14523621 ] Benedict commented on CASSANDRA-7486: - bq. ^ G1 doesn't work that way. While it has no old generation, it does promote regions and if this happens a lot you can get some weird pathological fragmentation. Now, my experience with G1 is out of date, and I haven't kept up at all with its latest behaviours, but I saw some really atrocious behaviour on very simple benchmarks a few years back. At the time, If you modified references that were randomly distributed around the heap, it required traversing a majority of the heap to collect very little, and essentially thrashed. I realise it has improved, but I do not know in what ways, and so I'm wary of making it the default without being certain it no longer has pathological cases that are a problem for us. Unless we stress the collector so that it exercises its suboptimal characteristics, I am not really super confident. I hope this is simply overly cautious, but we know of users who also had serious problems with sudden degradation despite looking good in initial testing, and it would be great for that not to be a widespread problem. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523146#comment-14523146 ] Benedict commented on CASSANDRA-7486: - bq. if I am reading correctly there was pretty never an old generation collection under the workload I looked at. The old gen was growing but never reached the point it needed to do an old gen GC. bq. Another behavior to consider is worst case pause time when there is fragmentation. These are concerns we should not dismiss out of hand. My concern is that these benchmarks in an idealised world of a steady rate of work production is not representative of a workload including repair, validation, long running huge compactions, hinting, periodic read/write load spikes. If these performance profiles are dependent on the memtables never being promoted, this is dependent on the disk keeping up, and under a worse but realistic workload the characteristics may be nothing like what [~ato...@datastax.com] is seeing. Changing these defaults should be done with the absolute utmost of care, and I would like to see a lot of very long running mixed workload tests including all of the other spanners in the works. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521805#comment-14521805 ] Ariel Weisberg commented on CASSANDRA-7486: --- I would just like to see the data visualized. If it's not better in every dimension then in what dimensions is better/worse in across all the data that Al collected. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520926#comment-14520926 ] Phil Yang commented on CASSANDRA-7486: -- Another small question :) Since many performance improvements were made to G1 in JDK 8 and its update releases, do we need to have a propose for jdk7 users especially its early versions to update to jdk8's latest version? And I find a JEP that Make G1 the default garbage collector on 32- and 64-bit server configurations. in jdk9. See http://openjdk.java.net/jeps/248 if you have not heard about it. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520071#comment-14520071 ] Jonathan Ellis commented on CASSANDRA-7486: --- IMO consistent performance should always take precedence over maximum performance/throughput. Agreed. I think our bar here should be Is G1 better for the average user keeping in mind that the average user is a *lot* worse at tuning CMS than Ariel. Power users can tune for their own workload the way they always have. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519791#comment-14519791 ] Albert P Tobey commented on CASSANDRA-7486: --- [~aweisberg] comparing promotions between G1 and CMS doesn't really make sense IMO. G1 promotions simply mark memory where it is without copying. After a threshold it will compact surviving cards into a single region. What I've observed is that compaction is rarely necessary with a big enough G1 heap. With a saturation write workload on Cassandra 2.1 only ~100-200MB seems to stick around for the long haul with almost all the rest getting cycled every few minutes (in an 8GB heap). [~yangzhe1991] I would keep the default heap at 8GB. I have tested with G1 at 16GB on a 30GB m3.2xlarge on EC2 and it generally gets better throughput and latency because there's more space for G1 to waste (that's what they call it). Intel tested up to 100GB with Hbase at 200ms pause target and said nice things about it. I don't see much need for C* to hit that size but it's certainly doable with G1. The main problem is smaller heaps where G1 starts to struggle a little, but I found that it still works OK down to 512MB, even if a bit less efficient than CMS since G1 targets ~10% CPU time for GC while the others target 1% by default. Throughput / latency is always a tradeoff and in the case of G1 with non-aggressive latency targets (-XX:MaxGCPauseMillis=2000) the throughput is darn close to CMS with considerably improved standard deviation on latency. IMO that's a great tradeoff as most of the users I talk to in the wild mostly struggle with getting reliable latency rather than throughput. IMO consistent performance should always take precedence over maximum performance/throughput. G1 provides a much more consistent experience with fewer knobs to mess with (especially tuning eden size, which is still a black art that nearly every installation I've looked at gets wrong). Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.x See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519456#comment-14519456 ] Ariel Weisberg commented on CASSANDRA-7486: --- For the C* we ship today we should evaluate whether G1 is better. For the platonic ideal C* (where the heap only needs to be 1 or 2 gig) I suspect we should ship CMS because I found it has lower baseline pause times for young gen collections especially on server class hardware. This is something that we can do in a data driven way. [~ato...@datastax.com] got some good data, but when I sampled throughput (from the spreadsheet) on some workloads like the 12g CMS and G1 I saw more throughput under CMS. I think we should munge the data a bit and visualize throughput and P99 (or P99.9). I am also not a fan of basing the decision off of that # of cores and a non-NUMA machine which is not representative of the hardware people use. I am not comfortable with the measurements for large heaps because if I am reading correctly there was pretty never an old generation collection under the workload I looked at. The old gen was growing but never reached the point it needed to do an old gen GC. It's great the server can run that long with so little promotion (TIL that is a thing that happens). That explains the very long young gen pauses. Lots of survivor copying I guess when I look at the size of survivor set vs pause time. I saw young gen pauses in the 400+ millisecond range under both collectors. Another behavior to consider is worst case pause time when there is fragmentation. With all the overhead of survivor copying I start to wonder if a valid strategy would be to allow promotion and let the concurrent collector run all the time. That would bring down young-gen GC pauses in exchange for throughput. I think whether 8099 means no off-heap memtables in 3.0 is also a factor. If G1 scales to larger heaps and larger on heap memtables then it will be a better choice. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.5 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519604#comment-14519604 ] Phil Yang commented on CASSANDRA-7486: -- I think the default option should be prudence and care enough to change. Usually for C* users, it is acceptable that there is no better performance in new version. However, it may be unacceptable if the new version get a worse performance. If there is risk that in some cases G1 is worse than CMS, it may be a better choice to make G1 an optional choice first by offering another conf/cassandra-env-g1.sh file to let people have a try and don't change the default settings. For the tests comparing G1 and CMS, does the tests cover some extreme case? For example: bootstrap/rebuild/remove node, repair, lots of queries over tombstone_failure_threshold... And I think each test should take at lease 24 hours to have several full GCs to estimate the latency. Furthermore, now using CMS, we have a max heap size (8GB) limit even if the memory of this node is very large. If we decide to change the default gc algorithm, what is the suitable new limit? Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.6 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518707#comment-14518707 ] Rick Branson commented on CASSANDRA-7486: - I think it definitely makes sense as a default. My guess is that it'll result in fewer headaches for most people. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.5 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518704#comment-14518704 ] Jonathan Ellis commented on CASSANDRA-7486: --- Sounds like G1 is finally ready to replace CMS. WDYT [~mstump] [~rbranson] [~aweisberg]? Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.5 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497603#comment-14497603 ] Albert P Tobey commented on CASSANDRA-7486: --- My benchmarks completed. These were run on 6 quad-core Intel NUCs with 16GB RAM / 240GB SSD / gigabit ethernet. Cassandra 2.1.4. The CPUs are fairly slow at 1.4Ghz. The tests were automated with a complete cluster rebuild between tests and caches dropped before starting Cassandra each time. The big win with G1 IMO is that it is auto-tuning. I've been running it on a few other kinds of machines and it generally does much better with more CPU power. cassandra-stress was run with an increased heap but is otherwise unmodified from Cassandra 2.1.4. I checked the gc log regularly and did not see many pauses for stress itself above 1ms here there, with most pauses in the ~300usec range. The final output of the stress is available here: https://docs.google.com/a/datastax.com/spreadsheets/d/19Eb7HGkd5rFUD_C0ZALbK6-R4fPF9vJRr8BrvxBwo38/edit?usp=sharing http://tobert.org/downloads/cassandra-2.1-cms-vs-g1.csv The stress commands, system.log, GC logs, conf directory from all the servers, and full stress logs are available on my webserver here: http://tobert.org/downloads/cassandra-2.1-cms-vs-g1-data.tar.gz (35MB) Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.5 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485579#comment-14485579 ] Albert P Tobey commented on CASSANDRA-7486: --- This is with 2.0 / OpenJDK8 since that's what I had running. Same everything each run except for heap size. cassandra-stress 2.1.4 read workload / 800 threads. I'll re-run with 2.1 / Oracle JDK8 and some mixed load. -XX:+UseG1GC Also: -XX:+UseTLAB -XX:+ResizeTLAB -XX:-UseBiasedLocking -XX:+AlwaysPreTouch but maybe those should go in a different ticket. 8GB: op rate : 139805 partition rate: 139805 row rate : 139805 latency mean : 5.7 latency median: 4.2 latency 95th percentile : 13.2 latency 99th percentile : 18.5 latency 99.9th percentile : 21.1 latency max : 303.8 512MB: op rate : 114214 partition rate: 114214 row rate : 114214 latency mean : 7.0 latency median: 3.7 latency 95th percentile : 12.4 latency 99th percentile : 14.7 latency 99.9th percentile : 15.3 latency max : 307.1 256MB: op rate : 60028 partition rate: 60028 row rate : 60028 latency mean : 13.3 latency median: 4.0 latency 95th percentile : 44.7 latency 99th percentile : 73.5 latency 99.9th percentile : 79.6 latency max : 1105.4 Same everything with mostly stock CMS settings for 2.0. I added the -XX:+UseTLAB -XX:+ResizeTLAB -XX:-UseBiasedLocking -XX:+AlwaysPreTouch settings to keep the numbers comparable to all of my other data. 8GB/1GB: op rate : 119155 partition rate: 119155 row rate : 119155 latency mean : 6.7 latency median: 4.1 latency 95th percentile : 11.8 latency 99th percentile : 15.5 latency 99.9th percentile : 17.3 latency max : 520.2 512MB ( -XX:+UseAdaptiveSizePolicy): op rate : 82375 partition rate: 82375 row rate : 82375 latency mean : 9.7 latency median: 4.3 latency 95th percentile : 28.2 latency 99th percentile : 49.4 latency 99.9th percentile : 54.8 latency max : 2642.6 256MB ( -XX:+UseAdaptiveSizePolicy): op rate : 77705 partition rate: 77705 row rate : 77705 latency mean : 10.3 latency median: 4.8 latency 95th percentile : 33.6 latency 99th percentile : 45.3 latency 99.9th percentile : 49.1 latency max : 1990.0 Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.5 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484318#comment-14484318 ] Albert P Tobey commented on CASSANDRA-7486: --- So far my testing of read workloads matches my experience with writes. An 8GB heap with generic G1GC settings is good for more workloads out of the box than haphazardly tuned CMS can be. I've been testing on a mix of Oracle/OpenJDK and JDK7/8 and the results are fairly consistent across the board with the exception that performance is a tad higher (~5%) on JDK8 than JDK7 (with G1GC - I have not tested CMS much on JDK8). These parameters get better throughput than CMS out of the box with significantly improved consistency in the max and p99.9 latency. -Xmx8G -Xms8G -XX:+UseG1GC If throughput is more critical than latency, the following will get a few % more throughput at the cost of potentially higher max pause times: -Xmx8G -Xms8G -XX:+UseG1GC -XX:MaxGCPauseMillis=2000 -XX:InitiatingHeapOccupancyPercent=75 My recommendation is to document the last two options in cassandra-env.sh but leave them disabled/commented out for end-users to fiddle with. Other knobs for G1 didn't make a statistically measurable difference in my observations. G1 scales particularly well with heap size on huge machines. 8 to 16GB doesn't seem to make a big difference, matching what [~rbranson] saw. At 24GB I started seeing about 8-10% throughput increase with little variance in pause times. IMO the simple G1 configuration should be the default for large heaps. It's simple and provides consistent latency. Because it uses heuristics to determine the eden size and scanning schedule, it will adapts well to diverse environments without tweaking. Heap sizes under 8GB should continue to use CMS or even experiment with serial collectors (e.g. Raspberry Pi, t2.micro, vagrant). If there is interest, I will write up a patch for cassandra-env.sh to make the auto-detection code pick G1GC at = 6GB heap and CMS for 6GB. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.5 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484645#comment-14484645 ] Jonathan Ellis commented on CASSANDRA-7486: --- [~ato...@datastax.com] wdyt of mstump's suggestions at CASSANDRA-8150? Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.5 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484647#comment-14484647 ] Jonathan Ellis commented on CASSANDRA-7486: --- Also, how much better is CMS for small heaps? Given that sub-8GB heaps aren't particularly common or recommended, can we just simplify it to use G1? Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.5 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484657#comment-14484657 ] Albert P Tobey commented on CASSANDRA-7486: --- I'll kick off some tests and find out. All of the Oracle docs say not to bother below 6GB, but yeah I agree, if it's basically not bad we should go with simple. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.5 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390359#comment-14390359 ] Robert Stupp commented on CASSANDRA-7486: - Nice :) Looking forward to see Oracle JVM and C* 2.1 results. (TBH, I don't expect much difference between OpenJDK8 and Oracle JDK8) But more interesting would be how G1 behaves w/ read and mixed workloads. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.4 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389850#comment-14389850 ] Albert P Tobey commented on CASSANDRA-7486: --- I managed to get G1 (Java 8) to beat CMS on both latency and throughput on my NUC cluster. Preliminary results: https://gist.github.com/tobert/ea9328e4873441c7fc34 Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.4 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353170#comment-14353170 ] Jeremy Hanna commented on CASSANDRA-7486: - Any update on this testing [~shawn.kumar]? Just wondered as this ticket seemed promising initially but hasn't been updated in some time. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Shawn Kumar Fix For: 2.1.4 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050677#comment-14050677 ] Rick Branson commented on CASSANDRA-7486: - Mad anecdotes: We ran with G1 enabled for around 4 days in a 33-node cluster running 1.2.17 on JDK7u45 that has around a 1:5 read:write ratio. We tried a few different configurations with short durations, but most of the time we ran it with the out-of-the-box G1 configuration on a 20G heap and 32 parallel GC threads (16 core, 32 hyperthreaded). There were some somewhat scary bugs fixed in 7u60 that ultimately caused me to roll back to the CMS collector after the experiment. * The experiment pointed out that our young gen was basically too small and was pulling latency up significantly. When we returned back to CMS, I doubled new size from 800M - 1600M. We had moved to new hardware and hadn't taken the time to sit down and play with GC settings. This cut our mean latency dramatically as perceived from the client, ~50% for writes and ~30% for reads, similar to what we saw with G1. I was quite thrilled with this result. * I tried both 100ms and 150ms pause times targets with 12G, 16G, and 20G heaps, and while these resulted in slightly lower mean latency (~5-10%), Mixed GC activity caused P99s to suffer greatly. There's compelling evidence that the 200ms default is nearly ideal for the way the G1 algorithm works in its current incarnation. * We basically needed a 20G heap to make G1 work well for us, since by default G1 will use up to half of the max heap for eden space and Cassandra needs quite a large old gen to stay happy. G1 appears to need a much larger eden space to work efficiently, sizes that would make ParNew die in a fire. GCs of the eden space were impressively fast, with a ~10G eden space taking ~120ms on average to collect. * G1's huge eden space was helpful working around some issues with compaction on hints CF which had dozens of very wide partitions, hundreds of thousands of cells each. * Overall, at the default 200ms pause time target, we didn't see much of an increase in CPU usage over CMS. In the end, my tests basically told us that G1 requires a larger heap to get the same results with *far* less tuning. If there are GC issues, it seems like in the vast majority of cases G1 can either eliminate them or G1 makes it easy to just workaround them by cranking up the heap size. Someone should probably test G1 with a variable-sized heap since it's designed to give back RAM when it thinks it doesn't need it. That might or might not actually work. While we didn't test this, a configuration of G1 + heap size min of 1/8 RAM and max of 1/2 RAM might make a really nice default for Cassandra at some point. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Ryan McGuire Fix For: 2.1.0 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049127#comment-14049127 ] Jonathan Ellis commented on CASSANDRA-7486: --- /cc [~rbranson] [~benedict] Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Ryan McGuire Fix For: 2.1.0 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal with 2.0 with moving most of memtables off heap. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049219#comment-14049219 ] T Jake Luciani commented on CASSANDRA-7486: --- See also https://software.intel.com/en-us/blogs/2014/06/18/part-1-tuning-java-garbage-collection-for-hbase 100gb heaps! Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Ryan McGuire Fix For: 2.1.0 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times
[ https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049217#comment-14049217 ] Benedict commented on CASSANDRA-7486: - We need to make sure we test over an extended period with a variety of operations being exercised against the cluster. This is probably a good opportunity to try and define a real world burn in test as well, and what parameters should be included. Some things to consider: # Range of data distributions, including (esp. for this) large partitions and very large cells. Possibly run two or three parallel stress profiles with very different data profiles to really give GC a headache dealing with different velocities / lifetimes. # Incremental and full repairs # Hint accumulation / node death # Tombstones / Range Tombstones # Secondary indexes? I'd suggest ignoring some variables, and e.g. stick with just netty, so we can define a single complex workload and run it for an extended period and get a good result. While our client buffers behave quite differently with each, I'm happy tuning defaults for native now it's faster. It might also be useful, for this test only, to see for a single node how well the two degrade as heap pressure increases, by artificially consuming large portions of the heap for the duration of a more simple stress test. Compare CMS and G1 pause times -- Key: CASSANDRA-7486 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486 Project: Cassandra Issue Type: Test Components: Config Reporter: Jonathan Ellis Assignee: Ryan McGuire Fix For: 2.1.0 See http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-gc-migration-to-expectations-and-advanced-tuning and https://twitter.com/rbranson/status/482113561431265281 May want to default 2.1 to G1. 2.1 is a different animal from 2.0 after moving most of memtables off heap. Suspect this will help G1 even more than CMS. (NB this is off by default but needs to be part of the test.) -- This message was sent by Atlassian JIRA (v6.2#6252)