I would recommend UseParallelGC since this is a batch job. Parallelization should be 2-3x of cores. Also if those are physical machines i would recommend 9000 as network mtu. Is 128 gb per node or 64 gb per node?
On Thu, Apr 26, 2018, 7:40 PM vincent gromakowski < vincent.gromakow...@gmail.com> wrote: > Ideal parallelization is 2-3x the nb of cores. But it depends on the > number of partitions of your source and the operation you use (Shuffle or > not). It can be worth paying the extra cost of an initial repartition to > match your cluster but it clearly depends on your DAG. > Optimizing spark apps depends on lots of thing, it's hard to answer > - cluster size > - scheduler > - spark version > - transformation graph (DAG) > ... > > Le jeu. 26 avr. 2018 à 17:49, Pallavi Singh <pallavi_si...@persistent.com> > a écrit : > >> Hi Team, >> >> >> >> We are currently working on POC based on Spark and Scala. >> >> we have to read 18million records from parquet file and perform the 25 >> user defined aggregation based on grouping keys. >> >> we have used spark high level Dataframe API for the aggregation. On >> cluster of two node we could finish end to end job >> ((Read+Aggregation+Write))in 2 min. >> >> >> >> *Cluster Information:* >> >> Number of Node:2 >> >> Total Core:28Core >> >> Total RAM:128GB >> >> >> >> *Component: * >> >> Spark Core >> >> >> >> *Scenario:* >> >> How-to >> >> >> >> *Tuning Parameter:* >> >> spark.serializer org.apache.spark.serializer.KryoSerializer >> >> spark.default.parallelism 24 >> >> spark.sql.shuffle.partitions 24 >> >> spark.executor.extraJavaOptions -XX:+UseG1GC >> >> spark.speculation true >> >> spark.executor.memory 16G >> >> spark.driver.memory 8G >> >> spark.sql.codegen true >> >> spark.sql.inMemoryColumnarStorage.batchSize 100000 >> >> spark.locality.wait 1s >> >> spark.ui.showConsoleProgress false >> >> spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec >> >> Please let us know, If you have any ideas/tuning parameter that we can >> use to finish the job in less than one min. >> >> >> >> >> >> Regards, >> >> Pallavi >> DISCLAIMER >> ========== >> This e-mail may contain privileged and confidential information which is >> the property of Persistent Systems Ltd. It is intended only for the use of >> the individual or entity to which it is addressed. If you are not the >> intended recipient, you are not authorized to read, retain, copy, print, >> distribute or use this message. If you have received this communication in >> error, please notify the sender and delete all copies of this message. >> Persistent Systems Ltd. does not accept any liability for virus infected >> mails. >> >