[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-26 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14559669#comment-14559669
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

https://github.com/tobert/cassandra/tree/g1gc

https://github.com/tobert/cassandra/commit/33bf6719e0c8e84672c3633f8ecce602affc3071
https://github.com/tobert/cassandra/commit/cafee86c3c5798e423689a26b43d05ed9312adc5

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Albert P Tobey
 Fix For: 3.0 beta 1


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-20 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553067#comment-14553067
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

I'll attach a patch ASAP.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Albert P Tobey
 Fix For: 3.0 beta 1


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-19 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550741#comment-14550741
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

Did you run into evacuation failures? How big was your heap? I haven't seen any 
evac failures with 2.1 and Java 8. This is one of the things that was worked on 
for Hotspot 1.8. Then again maybe it's Solr that needs the help.

I suspect you can remove a lot of these settings on Java 8, but have also 
discovered that setting the GC threads is necessary on many machines.

Try adding the below line for a nice decrease in p99 latencies.

JVM_OPTS=$JVM_OPTS -XX:G1RSetUpdatingPauseTimePercent=5

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-19 Thread Michael Perrone (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14550721#comment-14550721
 ] 

Michael Perrone commented on CASSANDRA-7486:


I have done extensive load testing with G1GC with Java 1.7_80 and Cassandra 
2.0.12.x versions with solr secondary indexes and 20GB max heap. On 8 core 
systems these options were the sweet spot for the test workload and worked out 
well in a production cluster, providing dramatic improvements in overall GC 
time and eliminating long CMS pauses that we could not tune out. I will try to 
attach some graphs/tables/metrics in another comment. 

{code:yaml}
JVM_OPTS=$JVM_OPTS -XX:+UseG1GC
# set these to the number of cores
JVM_OPTS=$JVM_OPTS -XX:ConcGCThreads=8
JVM_OPTS=$JVM_OPTS -XX:ParallelGCThreads=8
JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
# default is 10, this makes G1 slightly more aggressive
# by starting the marking cycle earlier
# in order to avoid evacuation failure (OOM)
JVM_OPTS=$JVM_OPTS -XX:G1ReservePercent=15
# default is 45, we should start sooner.
# in a high write large heap (20GB) this was
# found to eliminate Old gen pauses
JVM_OPTS=$JVM_OPTS -XX:InitiatingHeapOccupancyPercent=25
# default is 200 there is a tradeoff between latency
# and throughput. increase this to increase throughput
# at the cost of potential latency, up to 1000
JVM_OPTS=$JVM_OPTS -XX:MaxGCPauseMillis=500
# use largest possible region size
# speeds up marking phase, tradeoff is efficiency
# comment out to let JVM decide the size
VM_OPTS=$JVM_OPTS -XX:G1HeapRegionSize=32
{code}


 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-19 Thread Michael Perrone (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551202#comment-14551202
 ] 

Michael Perrone commented on CASSANDRA-7486:


Have not seen evacuation failures, but test systems ran tight enough under 
heavy load to increase the reserve percentage. Heap was 20GB. 

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Albert P Tobey
 Fix For: 3.0 beta 1


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-01 Thread Matt Stump (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14524278#comment-14524278
 ] 

Matt Stump commented on CASSANDRA-7486:
---

Before we talk about changing the defaults I would like to see tests run on 
something more representative of customer hardware. At the very least we should 
be doing comparisons of CMS vs G1 for different workloads on cstar. I did some 
initial testing and didn't see a huge benefit, but I very well could have been 
doing something wrong. I'm both hopeful and skeptical. 

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-01 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523593#comment-14523593
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

if I am reading correctly there was pretty never an old generation collection 
under the workload I looked at. The old gen was growing but never reached the 
point it needed to do an old gen GC.

^ G1 doesn't work that way.

Another behavior to consider is worst case pause time when there is 
fragmentation.

^ G1 performs compaction. It's fairly easy to trigger and observe in gc.log 
with Cassandra 2.0. It takes more work with 2.1 since it seems to be easier on 
the GC.

I'll see if I can find some time to generate graphs to make all this more 
convincing, but time is short because I'm spending all of my time tuning users' 
clusters where the #1 first issue every time is getting CMS to behave.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-01 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523630#comment-14523630
 ] 

Ariel Weisberg commented on CASSANDRA-7486:
---

bq. ^ G1 doesn't work that way.
I am talking about CMS. When I looked at the 12 gigabyte heap the old gen grew 
to 4.1 gigabytes and I didn't see any point that it shrunk.

bq.  I'm spending all of my time tuning users' clusters where the #1 first 
issue every time is getting CMS to behave.
We can make the case for G1 in different ways. If we want to do it based on 
real world results that is fine with me.

To Benedict's point I think looking at all the operations we care about on 
realistic time scales is something we would have to do to really know what the 
differences are. I wish we had this stuff in CI so it would just be a matter of 
changing the flags, but we aren't there yet.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-01 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523724#comment-14523724
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

I only started messing with G1 this year, so I only know the old behavior by 
lore I've read and heard. I have not observed significant problems it in the 
~20-40 hours I've spent tuning clusters with G1 recently.

I don't recommend anyone try G1 on JDK 7  u75 or JDK 8  u40 (although it's 
probably OK down to u20 according to the docs I've read). I did some testing on 
JDK7u75 and it was stable but didn't spend much time on it since JDK8u40 gave a 
nice bump in performance (5-10% on a customer cluster) by just switching JDKs 
and nothing else.

From what I've read about the reference clearing issues, there is a new-ish 
setting to enable parallel reference collection, -XX:+ParallelRefProcEnabled. 
The advice in the docs is to only turn it on if a significant amount of time 
is spent on RefProc collection, e.g.   [Ref Proc: 5.2 ms]. I pulled that 
from a log I had handy and that is high enough that we might want to consider 
enabling the flag, but in most of my observations it hovers around 0.1ms under 
saturation load.


 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523736#comment-14523736
 ] 

Benedict commented on CASSANDRA-7486:
-

bq. there is a new-ish setting to enable parallel reference collection

Throughput was something like 5% of other collectors in my testing, so 
parallelizing this would only help so much! :)

My point is, we don't really fully understand G1, and unless we undertake a 
research project to fully understand its pathological cases, and how they 
compare/contrast to its history, I'd prefer we ensured it behaved under complex 
loads, and not just under isolated read or write loading.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523621#comment-14523621
 ] 

Benedict commented on CASSANDRA-7486:
-

bq. ^ G1 doesn't work that way.

While it has no old generation, it does promote regions and if this happens a 
lot you can get some weird pathological fragmentation. Now, my experience with 
G1 is out of date, and I haven't kept up at all with its latest behaviours, but 
I saw some really atrocious behaviour on very simple benchmarks a few years 
back. At the time, If you modified references that were randomly distributed 
around the heap, it required traversing a majority of the heap to collect very 
little, and essentially thrashed. I realise it has improved, but I do not know 
in what ways, and so I'm wary of making it the default without being certain it 
no longer has pathological cases that are a problem for us. Unless we stress 
the collector so that it exercises its suboptimal characteristics, I am not 
really super confident. I hope this is simply overly cautious, but we know of 
users who also had serious problems with sudden degradation despite looking 
good in initial testing, and it would be great for that not to be a widespread 
problem.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-05-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14523146#comment-14523146
 ] 

Benedict commented on CASSANDRA-7486:
-

bq. if I am reading correctly there was pretty never an old generation 
collection under the workload I looked at. The old gen was growing but never 
reached the point it needed to do an old gen GC. 
bq. Another behavior to consider is worst case pause time when there is 
fragmentation.

These are concerns we should not dismiss out of hand. My concern is that these 
benchmarks in an idealised world of a steady rate of work production is not 
representative of a workload including repair, validation, long running huge 
compactions, hinting, periodic read/write load spikes. If these performance 
profiles are dependent on the memtables never being promoted, this is dependent 
on the disk keeping up, and under a worse but realistic workload the 
characteristics may be nothing like what [~ato...@datastax.com] is seeing. 
Changing these defaults should be done with the absolute utmost of care, and I 
would like to see a lot of very long running mixed workload tests including all 
of the other spanners in the works.

 

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-30 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14521805#comment-14521805
 ] 

Ariel Weisberg commented on CASSANDRA-7486:
---

I would just like to see the data visualized. If it's not better in every 
dimension then in what dimensions is better/worse in across all the data that 
Al collected.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-30 Thread Phil Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520926#comment-14520926
 ] 

Phil Yang commented on CASSANDRA-7486:
--

Another small question :)
Since many performance improvements were made to G1 in JDK 8 and its update 
releases, do we need to have a propose for jdk7 users especially its early 
versions to update to jdk8's latest version?

And I find a JEP that Make G1 the default garbage collector on 32- and 64-bit 
server configurations. in jdk9. See http://openjdk.java.net/jeps/248 if you 
have not heard about it.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-29 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520071#comment-14520071
 ] 

Jonathan Ellis commented on CASSANDRA-7486:
---

 IMO consistent performance should always take precedence over maximum 
 performance/throughput.

Agreed.  I think our bar here should be Is G1 better for the average user 
keeping in mind that the average user is a *lot* worse at tuning CMS than 
Ariel.  Power users can tune for their own workload the way they always have.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-29 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519791#comment-14519791
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

[~aweisberg] comparing promotions between G1 and CMS doesn't really make sense 
IMO. G1 promotions simply mark memory where it is without copying. After a 
threshold it will compact surviving cards into a single region. What I've 
observed is that compaction is rarely necessary with a big enough G1 heap. With 
a saturation write workload on Cassandra 2.1 only ~100-200MB seems to stick 
around for the long haul with almost all the rest getting cycled every few 
minutes (in an 8GB heap).

[~yangzhe1991] I would keep the default heap at 8GB. I have tested with G1 at 
16GB on a 30GB m3.2xlarge on EC2 and it generally gets better throughput and 
latency because there's more space for G1 to waste (that's what they call 
it). Intel tested up to 100GB with Hbase at 200ms pause target and said nice 
things about it. I don't see much need for C* to hit that size but it's 
certainly doable with G1. The main problem is smaller heaps where G1 starts to 
struggle a little, but I found that it still works OK down to 512MB, even if a 
bit less efficient than CMS since G1 targets ~10% CPU time for GC while the 
others target 1% by default.

Throughput / latency is always a tradeoff and in the case of G1 with 
non-aggressive latency targets (-XX:MaxGCPauseMillis=2000) the throughput is 
darn close to CMS with considerably improved standard deviation on latency. IMO 
that's a great tradeoff as most of the users I talk to in the wild mostly 
struggle with getting reliable latency rather than throughput.

IMO consistent performance should always take precedence over maximum 
performance/throughput. G1 provides a much more consistent experience with 
fewer knobs to mess with (especially tuning eden size, which is still a black 
art that nearly every installation I've looked at gets wrong).

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.x


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-29 Thread Ariel Weisberg (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519456#comment-14519456
 ] 

Ariel Weisberg commented on CASSANDRA-7486:
---

For the C* we ship today we should evaluate whether G1 is better. For the 
platonic ideal C* (where the heap only needs to be 1 or 2 gig) I suspect we 
should ship CMS because I found it has lower baseline pause times for young gen 
collections especially on server class hardware.

This is something that we can do in a data driven way. [~ato...@datastax.com] 
got some good data, but when I sampled throughput (from the spreadsheet) on 
some workloads like the 12g CMS and G1 I saw more throughput under CMS. I think 
we should munge the data a bit and visualize throughput and P99 (or P99.9). I 
am also not a fan of basing the decision off of that # of cores and a non-NUMA 
machine which is not representative of the hardware people use.

I am not comfortable with the measurements for large heaps because if I am 
reading correctly there was pretty  never an old generation collection under 
the workload I looked at. The old gen was growing but never reached the point 
it needed to do an old gen GC. It's great the server can run that long with so 
little promotion (TIL that is a thing that happens). That explains the very 
long young gen pauses. Lots of survivor copying I guess when I look at the size 
of survivor set vs pause time. I saw young gen pauses in the 400+ millisecond 
range under both collectors.

Another behavior to consider is worst case pause time when there is 
fragmentation.

With all the overhead of survivor copying I start to wonder if a valid strategy 
would be to allow promotion and let the concurrent collector run all the time. 
That would bring down young-gen GC pauses in exchange for throughput.

I think whether 8099 means no off-heap memtables in 3.0 is also a factor. If G1 
scales to larger heaps and larger on heap memtables then it will be a better 
choice.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.5


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-29 Thread Phil Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14519604#comment-14519604
 ] 

Phil Yang commented on CASSANDRA-7486:
--

I think the default option should be prudence and care enough to change. 
Usually for C* users, it is acceptable that there is no better performance in 
new version. However, it may be unacceptable if the new version get a worse 
performance. If there is risk that in some cases G1 is worse than CMS, it may 
be a better choice to make G1 an optional choice first by offering another 
conf/cassandra-env-g1.sh file to let people have a try and don't change the 
default settings.

For the tests comparing G1 and CMS, does the tests cover some extreme case? For 
example: bootstrap/rebuild/remove node, repair, lots of queries over 
tombstone_failure_threshold... And I think each test should take at lease 24 
hours to have several full GCs to estimate the latency.

Furthermore, now using CMS, we have a max heap size (8GB) limit even if the 
memory of this node is very large. If we decide to change the default gc 
algorithm, what is the suitable new limit?

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.6


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-28 Thread Rick Branson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518707#comment-14518707
 ] 

Rick Branson commented on CASSANDRA-7486:
-

I think it definitely makes sense as a default. My guess is that it'll result 
in fewer headaches for most people.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.5


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-28 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518704#comment-14518704
 ] 

Jonathan Ellis commented on CASSANDRA-7486:
---

Sounds like G1 is finally ready to replace CMS.  WDYT [~mstump] [~rbranson] 
[~aweisberg]?

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.5


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-15 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14497603#comment-14497603
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

My benchmarks completed. These were run on 6 quad-core Intel NUCs with 16GB RAM 
/ 240GB SSD / gigabit ethernet. Cassandra 2.1.4. The CPUs are fairly slow at 
1.4Ghz. The tests were automated with a complete cluster rebuild between tests 
and caches dropped before starting Cassandra each time.

The big win with G1 IMO is that it is auto-tuning. I've been running it on a 
few other kinds of machines and it generally does much better with more CPU 
power.

cassandra-stress was run with an increased heap but is otherwise unmodified 
from Cassandra 2.1.4. I checked the gc log regularly and did not see many 
pauses for stress itself above 1ms here  there, with most pauses in the 
~300usec range.

The final output of the stress is available here:

https://docs.google.com/a/datastax.com/spreadsheets/d/19Eb7HGkd5rFUD_C0ZALbK6-R4fPF9vJRr8BrvxBwo38/edit?usp=sharing
http://tobert.org/downloads/cassandra-2.1-cms-vs-g1.csv

The stress commands, system.log, GC logs, conf directory from all the servers, 
and full stress logs are available on my webserver here:

http://tobert.org/downloads/cassandra-2.1-cms-vs-g1-data.tar.gz (35MB)


 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.5


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-08 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14485579#comment-14485579
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

This is with 2.0 / OpenJDK8 since that's what I had running. Same everything 
each run except for heap size. cassandra-stress 2.1.4 read workload / 800 
threads. I'll re-run with 2.1 / Oracle JDK8 and some mixed load.

-XX:+UseG1GC

Also: -XX:+UseTLAB -XX:+ResizeTLAB  -XX:-UseBiasedLocking -XX:+AlwaysPreTouch 
but maybe those should go in a different ticket.

8GB:

op rate   : 139805
partition rate: 139805
row rate  : 139805
latency mean  : 5.7
latency median: 4.2
latency 95th percentile   : 13.2
latency 99th percentile   : 18.5
latency 99.9th percentile : 21.1
latency max   : 303.8

512MB:

op rate   : 114214
partition rate: 114214
row rate  : 114214
latency mean  : 7.0
latency median: 3.7
latency 95th percentile   : 12.4
latency 99th percentile   : 14.7
latency 99.9th percentile : 15.3
latency max   : 307.1

256MB:

op rate   : 60028
partition rate: 60028
row rate  : 60028
latency mean  : 13.3
latency median: 4.0
latency 95th percentile   : 44.7
latency 99th percentile   : 73.5
latency 99.9th percentile : 79.6
latency max   : 1105.4

Same everything with mostly stock CMS settings for 2.0. I added the  
-XX:+UseTLAB -XX:+ResizeTLAB  -XX:-UseBiasedLocking -XX:+AlwaysPreTouch 
settings to keep the numbers comparable to all of my other data.

8GB/1GB:

op rate   : 119155
partition rate: 119155
row rate  : 119155
latency mean  : 6.7
latency median: 4.1
latency 95th percentile   : 11.8
latency 99th percentile   : 15.5
latency 99.9th percentile : 17.3
latency max   : 520.2


512MB ( -XX:+UseAdaptiveSizePolicy):

op rate   : 82375
partition rate: 82375
row rate  : 82375
latency mean  : 9.7
latency median: 4.3
latency 95th percentile   : 28.2
latency 99th percentile   : 49.4
latency 99.9th percentile : 54.8
latency max   : 2642.6


256MB ( -XX:+UseAdaptiveSizePolicy):

op rate   : 77705
partition rate: 77705
row rate  : 77705
latency mean  : 10.3
latency median: 4.8
latency 95th percentile   : 33.6
latency 99th percentile   : 45.3
latency 99.9th percentile : 49.1
latency max   : 1990.0


 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.5


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-07 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484318#comment-14484318
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

So far my testing of read workloads matches my experience with writes. An 8GB 
heap with generic G1GC settings is good for more workloads out of the box 
than haphazardly tuned CMS can be. I've been testing on a mix of Oracle/OpenJDK 
and JDK7/8 and the results are fairly consistent across the board with the 
exception that performance is a tad higher (~5%) on JDK8 than JDK7 (with G1GC - 
I have not tested CMS much on JDK8).

These parameters get better throughput than CMS out of the box with 
significantly improved consistency in the max and p99.9 latency.

-Xmx8G -Xms8G -XX:+UseG1GC

If throughput is more critical than latency, the following will get a few % 
more throughput at the cost of potentially higher max pause times:

-Xmx8G -Xms8G -XX:+UseG1GC -XX:MaxGCPauseMillis=2000 
-XX:InitiatingHeapOccupancyPercent=75

My recommendation is to document the last two options in cassandra-env.sh but 
leave them disabled/commented out for end-users to fiddle with. Other knobs for 
G1 didn't make a statistically measurable difference in my observations.

G1 scales particularly well with heap size on huge machines. 8 to 16GB doesn't 
seem to make a big difference, matching what [~rbranson] saw. At 24GB I started 
seeing about 8-10% throughput increase with little variance in pause times.

IMO the simple G1 configuration should be the default for large heaps. It's 
simple and provides consistent latency. Because it uses heuristics to determine 
the eden size and scanning schedule, it will adapts well to diverse 
environments without tweaking. Heap sizes under 8GB should continue to use CMS 
or even experiment with serial collectors (e.g. Raspberry Pi, t2.micro, 
vagrant). If there is interest, I will write up a patch for cassandra-env.sh to 
make the auto-detection code pick G1GC at = 6GB heap and CMS for  6GB.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.5


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-07 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484645#comment-14484645
 ] 

Jonathan Ellis commented on CASSANDRA-7486:
---

[~ato...@datastax.com] wdyt of mstump's suggestions at CASSANDRA-8150?

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.5


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-07 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484647#comment-14484647
 ] 

Jonathan Ellis commented on CASSANDRA-7486:
---

Also, how much better is CMS for small heaps?  Given that sub-8GB heaps aren't 
particularly common or recommended, can we just simplify it to use G1?

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.5


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-07 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484657#comment-14484657
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

I'll kick off some tests and find out. All of the Oracle docs say not to bother 
below 6GB, but yeah I agree, if it's basically not bad we should go with simple.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.5


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-04-01 Thread Robert Stupp (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390359#comment-14390359
 ] 

Robert Stupp commented on CASSANDRA-7486:
-

Nice :)

Looking forward to see Oracle JVM and C* 2.1 results. (TBH, I don't expect much 
difference between OpenJDK8 and Oracle JDK8)

But more interesting would be how G1 behaves w/ read and mixed workloads.


 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.4


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-03-31 Thread Albert P Tobey (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14389850#comment-14389850
 ] 

Albert P Tobey commented on CASSANDRA-7486:
---

I managed to get G1 (Java 8) to beat CMS on both latency and throughput on my 
NUC cluster.

Preliminary results: https://gist.github.com/tobert/ea9328e4873441c7fc34



 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.4


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2015-03-09 Thread Jeremy Hanna (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353170#comment-14353170
 ] 

Jeremy Hanna commented on CASSANDRA-7486:
-

Any update on this testing [~shawn.kumar]?  Just wondered as this ticket seemed 
promising initially but hasn't been updated in some time.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Shawn Kumar
 Fix For: 2.1.4


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-7486gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2014-07-02 Thread Rick Branson (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14050677#comment-14050677
 ] 

Rick Branson commented on CASSANDRA-7486:
-

Mad anecdotes:

We ran with G1 enabled for around 4 days in a 33-node cluster running 1.2.17 on 
JDK7u45 that has around a 1:5 read:write ratio. We tried a few different 
configurations with short durations, but most of the time we ran it with the 
out-of-the-box G1 configuration on a 20G heap and 32 parallel GC threads (16 
core, 32 hyperthreaded). There were some somewhat scary bugs fixed in 7u60 that 
ultimately caused me to roll back to the CMS collector after the experiment.

* The experiment pointed out that our young gen was basically too small and was 
pulling latency up significantly. When we returned back to CMS, I doubled new 
size from 800M - 1600M. We had moved to new hardware and hadn't taken the time 
to sit down and play with GC settings. This cut our mean latency dramatically 
as perceived from the client, ~50% for writes and ~30% for reads, similar to 
what we saw with G1. I was quite thrilled with this result.
* I tried both 100ms and 150ms pause times targets with 12G, 16G, and 20G 
heaps, and while these resulted in slightly lower mean latency (~5-10%), Mixed 
GC activity caused P99s to suffer greatly. There's compelling evidence that the 
200ms default is nearly ideal for the way the G1 algorithm works in its current 
incarnation.
* We basically needed a 20G heap to make G1 work well for us, since by default 
G1 will use up to half of the max heap for eden space and Cassandra needs quite 
a large old gen to stay happy. G1 appears to need a much larger eden space to 
work efficiently, sizes that would make ParNew die in a fire. GCs of the eden 
space were impressively fast, with a ~10G eden space taking ~120ms on average 
to collect.
* G1's huge eden space was helpful working around some issues with compaction 
on hints CF which had dozens of very wide partitions, hundreds of thousands of 
cells each.
* Overall, at the default 200ms pause time target, we didn't see much of an 
increase in CPU usage over CMS.

In the end, my tests basically told us that G1 requires a larger heap to get 
the same results with *far* less tuning. If there are GC issues, it seems like 
in the vast majority of cases G1 can either eliminate them or G1 makes it easy 
to just workaround them by cranking up the heap size. Someone should probably 
test G1 with a variable-sized heap since it's designed to give back RAM when it 
thinks it doesn't need it. That might or might not actually work. While we 
didn't test this, a configuration of G1 + heap size min of 1/8 RAM and max of 
1/2 RAM might make a really nice default for Cassandra at some point.

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Ryan McGuire
 Fix For: 2.1.0


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2014-07-01 Thread Jonathan Ellis (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049127#comment-14049127
 ] 

Jonathan Ellis commented on CASSANDRA-7486:
---

/cc [~rbranson] [~benedict]

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Ryan McGuire
 Fix For: 2.1.0


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal with 2.0 with moving most of memtables off heap.  
 (NB this is off by default but needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2014-07-01 Thread T Jake Luciani (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049219#comment-14049219
 ] 

T Jake Luciani commented on CASSANDRA-7486:
---

See also 
https://software.intel.com/en-us/blogs/2014/06/18/part-1-tuning-java-garbage-collection-for-hbase

100gb heaps!

 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Ryan McGuire
 Fix For: 2.1.0


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (CASSANDRA-7486) Compare CMS and G1 pause times

2014-07-01 Thread Benedict (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14049217#comment-14049217
 ] 

Benedict commented on CASSANDRA-7486:
-

We need to make sure we test over an extended period with a variety of 
operations being exercised against the cluster. This is probably a good 
opportunity to try and define a real world burn in test as well, and what 
parameters should be included.

Some things to consider:

# Range of data distributions, including (esp. for this) large partitions and 
very large cells. Possibly run two or three parallel stress profiles with very 
different data profiles to really give GC a headache dealing with different 
velocities / lifetimes.
# Incremental and full repairs
# Hint accumulation / node death
# Tombstones / Range Tombstones
# Secondary indexes?

I'd suggest ignoring some variables, and e.g. stick with just netty, so we can 
define a single complex workload and run it for an extended period and get a 
good result. While our client buffers behave quite differently with each, I'm 
happy tuning defaults for native now it's faster.

It might also be useful, for this test only, to see for a single node how well 
the two degrade as heap pressure increases, by artificially consuming large 
portions of the heap for the duration of a more simple stress test.


 Compare CMS and G1 pause times
 --

 Key: CASSANDRA-7486
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7486
 Project: Cassandra
  Issue Type: Test
  Components: Config
Reporter: Jonathan Ellis
Assignee: Ryan McGuire
 Fix For: 2.1.0


 See 
 http://www.slideshare.net/MonicaBeckwith/garbage-first-garbage-collector-g1-gc-migration-to-expectations-and-advanced-tuning
  and https://twitter.com/rbranson/status/482113561431265281
 May want to default 2.1 to G1.
 2.1 is a different animal from 2.0 after moving most of memtables off heap.  
 Suspect this will help G1 even more than CMS.  (NB this is off by default but 
 needs to be part of the test.)



--
This message was sent by Atlassian JIRA
(v6.2#6252)