I have a spark app that reads avro & sequence file data and performs join,
reduceByKey

Results:
Command for all runs:
./bin/spark-submit -v --master yarn-cluster --driver-class-path
/apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-EBAY-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar
--jars
/apache/hadoop-2.4.1-2.1.3.0-2-EBAY/share/hadoop/hdfs/hadoop-hdfs-2.4.1-EBAY-2.jar,/home/dvasthimal/spark1.3/1.3.1.lib/spark_reporting_dep_only-1.0-SNAPSHOT-jar-with-dependencies.jar
 *--num-executors 9973 --driver-memory 14g --driver-java-options
"-XX:MaxPermSize=512M" --executor-memory 14g --executor-cores 1 --queue xy*
--class com.ebay.ep.poc.spark.reporting.SparkApp
/home/dvasthimal/spark1.3/1.3.1.lib/spark_reporting-1.0-SNAPSHOT.jar
startDate=2015-04-29 endDate=2015-04-29
input=/apps/hdmi-prod/b_um/epdatasets/exptsession subcommand=viewItem
output=/user/dvasthimal/epdatasets/viewItem buffersize=128
maxbuffersize=1068 maxResultSize=200G

I)
Input:
(View) RDD1: 2 Days = 20 Files = 17,328,796,820 bytes = PARTIAL
(Listing) RDD2: 100 Files = 267,748,253,700 bytes  = PARTIAL
(SPS) RDD3: 63 Files = 142,687,611,360 bytes = FULL

Output:
hadoop fs -count epdatasets/viewItem
           1          101          342246603 epdatasets/viewItem
Runtime: 26mins, 36sec


II)
Input:
(View) RDD1: 2 Days = 40 Files = 34,657,593,640 bytes = PARTIAL
(Listing) RDD2: 100 Files = 267,748,253,700 bytes  = PARTIAL
(SPS) RDD3: 63 Files = 142,687,611,360 bytes = FULL

Output:
hadoop fs -count epdatasets/viewItem
           1          101          667790782 epdatasets/viewItem
Runtime: 40mins, 49sec


I cannot increase memory as 14G is limit.
I can increase number of executors and cores. Please suggest how to make
this app run faster.
-- 
Deepak

Reply via email to