I have a spark app that reads avro & sequence file data and performs join, reduceByKey
Results: Command for all runs: ./bin/spark-submit -v --master yarn-cluster --driver-class-path /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar --jars /apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar,/home/dvasthimal/spark1.3/1.3.1.lib/spark_reporting_dep_only-1.0-SNAPSHOT-jar-with-dependencies.jar *--num-executors 9973 --driver-memory 14g --driver-java-options "-XX:MaxPermSize=512M" --executor-memory 14g --executor-cores 1 --queue xy* --class com.ebay.ep.poc.spark.reporting.SparkApp /home/dvasthimal/spark1.3/1.3.1.lib/spark_reporting-1.0-SNAPSHOT.jar startDate=2015-04-29 endDate=2015-04-29 input=/apps/hdmi-prod/b_um/epdatasets/exptsession subcommand=viewItem output=/user/dvasthimal/epdatasets/viewItem buffersize=128 maxbuffersize=1068 maxResultSize=200G I) Input: (View) RDD1: 2 Days = 20 Files = 17,328,796,820 bytes = PARTIAL (Listing) RDD2: 100 Files = 267,748,253,700 bytes = PARTIAL (SPS) RDD3: 63 Files = 142,687,611,360 bytes = FULL Output: hadoop fs -count epdatasets/viewItem 1 101 342246603 epdatasets/viewItem Runtime: 26mins, 36sec II) Input: (View) RDD1: 2 Days = 40 Files = 34,657,593,640 bytes = PARTIAL (Listing) RDD2: 100 Files = 267,748,253,700 bytes = PARTIAL (SPS) RDD3: 63 Files = 142,687,611,360 bytes = FULL Output: hadoop fs -count epdatasets/viewItem 1 101 667790782 epdatasets/viewItem Runtime: 40mins, 49sec I cannot increase memory as 14G is limit. I can increase number of executors and cores. Please suggest how to make this app run faster. -- Deepak