Hi, Cluster - 5 node (1 Driver and 4 workers) Driver Config: 16 cores, 32 GB RAM Worker Config: 8 cores, 16 GB RAM
I'm using the below parameters from which I know the first chunk is cluster dependent and the second chunk is data/code dependent. --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5 --driver-memory 25G --conf spark.sql.shuffle.partitions=100 --conf spark.driver.maxResultSize=2G --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf spark.scheduler.listenerbus.eventqueue.capacity=20000 I've come upto these values depending on my R&D on the properties and the issues I faced and hence the handles. My ask here is - *1) How can I infer, using some formula or a code, to calculate the below chunk dependent on the data/code?* *2) What are the other usable properties/configurations which I can use to shorten my job runtime?* Thanks, Aakash.
