Hi all - I'm running the Spark LDA algorithm on a dataset of roughly 3 million terms with a resulting RDD of approximately 20 GB on a 5 node cluster with 10 executors (3 cores each) and 14gb of memory per executor.
As the application runs, I'm seeing progressively longer execution times for the mapPartitions stage (18s - 56s - 3.4min) being caused by progressively longer shuffle read times. Is there any way to speed up to tune this out? My configs are below. screen spark-shell --driver-memory 15g --num-executors 10 --executor-cores 3 --conf "spark.executor.memory=14g" --conf "spark.io.compression.codec=lz4" --conf "spark.shuffle.consolidateFiles=true" --conf "spark.dynamicAllocation.enabled=false" --conf "spark.shuffle.manager=tungsten-sort" --conf "spark.akka.frameSize=1028" --conf "spark.executor.extraJavaOptions=-Xss256k -XX:MaxPermSize=128m -XX:PermSize=96m -XX:MaxTenuringThreshold=2 -XX:SurvivorRatio=6 -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AggressiveOpts -XX:+UseCompressedOops" --master yarn-client -Ilya Ganelin