Inferring Data driven Spark parameters

Aakash Basu Tue, 03 Jul 2018 00:35:11 -0700

Hi,

Cluster - 5 node (1 Driver and 4 workers)
Driver Config: 16 cores, 32 GB RAM
Worker Config: 8 cores, 16 GB RAM


I'm using the below parameters from which I know the first chunk is cluster
dependent and the second chunk is data/code dependent.

--num-executors 4
--executor-cores 5
--executor-memory 10G
--driver-cores 5
--driver-memory 25G


--conf spark.sql.shuffle.partitions=100
--conf spark.driver.maxResultSize=2G
--conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC"
--conf spark.scheduler.listenerbus.eventqueue.capacity=20000

I've come upto these values depending on my R&D on the properties and the
issues I faced and hence the handles.

My ask here is -

*1) How can I infer, using some formula or a code, to calculate the below
chunk dependent on the data/code?*
*2) What are the other usable properties/configurations which I can use to
shorten my job runtime?*

Thanks,
Aakash.

Inferring Data driven Spark parameters

Reply via email to