Re: Spark on Java 17

2023-12-09 Thread Jörn Franke
It is just a goal… however I would not tune the no of regions or region size yet.Simply specify gc algorithm and max heap size.Try to tune other options only if there is a need, only one at at time (otherwise it is difficult to determine cause/effects) and have a performance testing framework in place to be able to measure differences.Do you need those large heaps in Spark? Why not split the tasks further to have more tasks with less memory ?I understand that each job is different and there can be reasons for it, but I often try to just use the defaults and then tune individual options. I try to also avoid certain extreme values (of course there are cases when they are needed). Especially often when upgrading from one Spark version to another then I find out it is then often better to work with a Spark job with default settings, because Spark itself has improved/changed how it works.To reduce the needed heap you can try to increase the number of tasks ( see here https://spark.apache.org/docs/latest/configuration.html)spark.executor.cores (to a few) and spark.sql.shuffle.partitions (default is 200 - you can try how much it brings to change it to 400 etc).and reducespark.executor.memoryAm 10.12.2023 um 02:33 schrieb Faiz Halde :Thanks, IL check them outCurious though, the official G1GC page https://www.oracle.com/technical-resources/articles/java/g1gc.html says that there must be no more than 2048 regions and region size is limited upto 32mbThat's strange because our heaps go up to 100gb and that would require 64mb region size to be under 2048ThanksFaizOn Sat, Dec 9, 2023, 10:33 Luca Canali  wrote:







Hi Faiz,
 
We find G1GC works well for some of our workloads that are Parquet-read intensive and we have been using G1GC with Spark on Java 8 already (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions= “-XX:+UseG1GC”),
 while currently we are mostly running Spark (3.3 and higher) on Java 11.  
However, the best is always to refer to measurements of your specific workloads, let me know if you find something different. 

BTW besides the WebUI, I typically measure GC time also with a couple of custom tools:
https://github.com/cerndb/spark-dashboard and  https://github.com/LucaCanali/sparkMeasure 

A few tests of microbenchmarking Spark reading Parquet with a few different JDKs at:
https://db-blog.web.cern.ch/node/192 

 
Best,
Luca
 
 

From: Faiz Halde  
Sent: Thursday, December 7, 2023 23:25
To: user@spark.apache.org
Subject: Spark on Java 17

 

Hello,

 


We are planning to switch to Java 17 for Spark and were wondering if there's any obvious learnings from anybody related to JVM tuning?


 


We've been running on Java 8 for a while now and used to use the parallel GC as that used to be a general recommendation for high throughout systems. How has the default G1GC worked out with Spark?


 


Thanks


Faiz








Re: Spark on Java 17

2023-12-09 Thread Faiz Halde
Thanks, IL check them out

Curious though, the official G1GC page
https://www.oracle.com/technical-resources/articles/java/g1gc.html says
that there must be no more than 2048 regions and region size is limited
upto 32mb

That's strange because our heaps go up to 100gb and that would require 64mb
region size to be under 2048

Thanks
Faiz

On Sat, Dec 9, 2023, 10:33 Luca Canali  wrote:

> Hi Faiz,
>
>
>
> We find G1GC works well for some of our workloads that are Parquet-read
> intensive and we have been using G1GC with Spark on Java 8 already
> (spark.driver.extraJavaOptions and spark.executor.extraJavaOptions=
> “-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and
> higher) on Java 11.
>
> However, the best is always to refer to measurements of your specific
> workloads, let me know if you find something different.
> BTW besides the WebUI, I typically measure GC time also with a couple of
> custom tools: https://github.com/cerndb/spark-dashboard and
> https://github.com/LucaCanali/sparkMeasure
>
> A few tests of microbenchmarking Spark reading Parquet with a few
> different JDKs at: https://db-blog.web.cern.ch/node/192
>
>
>
> Best,
>
> Luca
>
>
>
>
>
> *From:* Faiz Halde 
> *Sent:* Thursday, December 7, 2023 23:25
> *To:* user@spark.apache.org
> *Subject:* Spark on Java 17
>
>
>
> Hello,
>
>
>
> We are planning to switch to Java 17 for Spark and were wondering if
> there's any obvious learnings from anybody related to JVM tuning?
>
>
>
> We've been running on Java 8 for a while now and used to use the parallel
> GC as that used to be a general recommendation for high throughout systems.
> How has the default G1GC worked out with Spark?
>
>
>
> Thanks
>
> Faiz
>


Re: Spark on Java 17

2023-12-09 Thread Jörn Franke
If you do tests with newer Java versions you can also try:

- UseNUMA: -XX:+UseNUMA. See https://openjdk.org/jeps/345

You can also assess the new Java GC algorithms:
- -XX:+UseShenandoahGC - works with terabyte of heaps - more memory efficient 
than zgc with heaps <32 GB. See also: 
https://developers.redhat.com/articles/2021/09/16/shenandoah-openjdk-17-sub-millisecond-gc-pauses
-  -XX:+UseZGC - works also with terabytes of heaps - see also 
https://www.baeldung.com/jvm-zgc-garbage-collector

Note: in jdk 21 zgc has an additional option that could make sense to activate:

-XX:+ZGenerational

See also 
https://developers.redhat.com/articles/2021/11/02/how-choose-best-java-garbage-collector

Note: it might be worth to try also JDK 21 - it has for certain GCs 
optimizations (amongst other things - I wonder how much improvement virtual 
threads can bring to Spark)

> Am 08.12.2023 um 01:02 schrieb Faiz Halde :
> 
> 
> Hello,
> 
> We are planning to switch to Java 17 for Spark and were wondering if there's 
> any obvious learnings from anybody related to JVM tuning?
> 
> We've been running on Java 8 for a while now and used to use the parallel GC 
> as that used to be a general recommendation for high throughout systems. How 
> has the default G1GC worked out with Spark?
> 
> Thanks
> Faiz


RE: Spark on Java 17

2023-12-09 Thread Luca Canali
Hi Faiz,

We find G1GC works well for some of our workloads that are Parquet-read 
intensive and we have been using G1GC with Spark on Java 8 already 
(spark.driver.extraJavaOptions and spark.executor.extraJavaOptions= 
“-XX:+UseG1GC”), while currently we are mostly running Spark (3.3 and higher) 
on Java 11.
However, the best is always to refer to measurements of your specific 
workloads, let me know if you find something different.
BTW besides the WebUI, I typically measure GC time also with a couple of custom 
tools: https://github.com/cerndb/spark-dashboard and  
https://github.com/LucaCanali/sparkMeasure
A few tests of microbenchmarking Spark reading Parquet with a few different 
JDKs at: https://db-blog.web.cern.ch/node/192

Best,
Luca


From: Faiz Halde 
Sent: Thursday, December 7, 2023 23:25
To: user@spark.apache.org
Subject: Spark on Java 17

Hello,

We are planning to switch to Java 17 for Spark and were wondering if there's 
any obvious learnings from anybody related to JVM tuning?

We've been running on Java 8 for a while now and used to use the parallel GC as 
that used to be a general recommendation for high throughout systems. How has 
the default G1GC worked out with Spark?

Thanks
Faiz