Re: Spark on Java 17
It is just a goal… however I would not tune the no of regions or region size yet.Simply specify gc algorithm and max heap size.Try to tune other options only if there is a need, only one at at time (otherwise it is difficult to determine cause/effects) and have a performance testing framework in place to be able to measure differences.Do you need those large heaps in Spark? Why not split the tasks further to have more tasks with less memory ?I understand that each job is different and there can be reasons for it, but I often try to just use the defaults and then tune individual options. I try to also avoid certain extreme values (of course there are cases when they are needed). Especially often when upgrading from one Spark version to another then I find out it is then often better to work with a Spark job with default settings, because Spark itself has improved/changed how it works.To reduce the needed heap you can try to increase the number of tasks ( see here https://spark.apache.org/docs/latest/configuration.html)spark.executor.cores (to a few) and spark.sql.shuffle.partitions (default is 200 - you can try how much it brings to change it to 400 etc).and reducespark.executor.memoryAm 10.12.2023 um 02:33 schrieb Faiz Halde :Thanks, IL check them outCurious though, the official G1GC page https://www.oracle.com/technical-resources/articles/java/g1gc.html says that there must be no more than 2048 regions and region size is limited upto 32mbThat's strange because our heaps go up to 100gb and that would require 64mb region size to be under 2048ThanksFaizOn Sat, Dec 9, 2023, 10:33 Luca Canaliwrote: Hi Faiz, We find G1GC works well for some of our workloads that are Parquet-read intensive and we have been using G1GC with Spark on Java 8 already (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions= “-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and higher) on Java 11. However, the best is always to refer to measurements of your specific workloads, let me know if you find something different. BTW besides the WebUI, I typically measure GC time also with a couple of custom tools: https://github.com/cerndb/spark-dashboard and https://github.com/LucaCanali/sparkMeasure A few tests of microbenchmarking Spark reading Parquet with a few different JDKs at: https://db-blog.web.cern.ch/node/192 Best, Luca From: Faiz Halde Sent: Thursday, December 7, 2023 23:25 To: user@spark.apache.org Subject: Spark on Java 17 Hello, We are planning to switch to Java 17 for Spark and were wondering if there's any obvious learnings from anybody related to JVM tuning? We've been running on Java 8 for a while now and used to use the parallel GC as that used to be a general recommendation for high throughout systems. How has the default G1GC worked out with Spark? Thanks Faiz
Re: Spark on Java 17
Thanks, IL check them out Curious though, the official G1GC page https://www.oracle.com/technical-resources/articles/java/g1gc.html says that there must be no more than 2048 regions and region size is limited upto 32mb That's strange because our heaps go up to 100gb and that would require 64mb region size to be under 2048 Thanks Faiz On Sat, Dec 9, 2023, 10:33 Luca Canali wrote: > Hi Faiz, > > > > We find G1GC works well for some of our workloads that are Parquet-read > intensive and we have been using G1GC with Spark on Java 8 already > (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions= > “-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and > higher) on Java 11. > > However, the best is always to refer to measurements of your specific > workloads, let me know if you find something different. > BTW besides the WebUI, I typically measure GC time also with a couple of > custom tools: https://github.com/cerndb/spark-dashboard and > https://github.com/LucaCanali/sparkMeasure > > A few tests of microbenchmarking Spark reading Parquet with a few > different JDKs at: https://db-blog.web.cern.ch/node/192 > > > > Best, > > Luca > > > > > > *From:* Faiz Halde > *Sent:* Thursday, December 7, 2023 23:25 > *To:* user@spark.apache.org > *Subject:* Spark on Java 17 > > > > Hello, > > > > We are planning to switch to Java 17 for Spark and were wondering if > there's any obvious learnings from anybody related to JVM tuning? > > > > We've been running on Java 8 for a while now and used to use the parallel > GC as that used to be a general recommendation for high throughout systems. > How has the default G1GC worked out with Spark? > > > > Thanks > > Faiz >
Re: Spark on Java 17
If you do tests with newer Java versions you can also try: - UseNUMA: -XX:+UseNUMA. See https://openjdk.org/jeps/345 You can also assess the new Java GC algorithms: - -XX:+UseShenandoahGC - works with terabyte of heaps - more memory efficient than zgc with heaps <32 GB. See also: https://developers.redhat.com/articles/2021/09/16/shenandoah-openjdk-17-sub-millisecond-gc-pauses - -XX:+UseZGC - works also with terabytes of heaps - see also https://www.baeldung.com/jvm-zgc-garbage-collector Note: in jdk 21 zgc has an additional option that could make sense to activate: -XX:+ZGenerational See also https://developers.redhat.com/articles/2021/11/02/how-choose-best-java-garbage-collector Note: it might be worth to try also JDK 21 - it has for certain GCs optimizations (amongst other things - I wonder how much improvement virtual threads can bring to Spark) > Am 08.12.2023 um 01:02 schrieb Faiz Halde : > > > Hello, > > We are planning to switch to Java 17 for Spark and were wondering if there's > any obvious learnings from anybody related to JVM tuning? > > We've been running on Java 8 for a while now and used to use the parallel GC > as that used to be a general recommendation for high throughout systems. How > has the default G1GC worked out with Spark? > > Thanks > Faiz
RE: Spark on Java 17
Hi Faiz, We find G1GC works well for some of our workloads that are Parquet-read intensive and we have been using G1GC with Spark on Java 8 already (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions= “-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and higher) on Java 11. However, the best is always to refer to measurements of your specific workloads, let me know if you find something different. BTW besides the WebUI, I typically measure GC time also with a couple of custom tools: https://github.com/cerndb/spark-dashboard and https://github.com/LucaCanali/sparkMeasure A few tests of microbenchmarking Spark reading Parquet with a few different JDKs at: https://db-blog.web.cern.ch/node/192 Best, Luca From: Faiz Halde Sent: Thursday, December 7, 2023 23:25 To: user@spark.apache.org Subject: Spark on Java 17 Hello, We are planning to switch to Java 17 for Spark and were wondering if there's any obvious learnings from anybody related to JVM tuning? We've been running on Java 8 for a while now and used to use the parallel GC as that used to be a general recommendation for high throughout systems. How has the default G1GC worked out with Spark? Thanks Faiz