Hi,

I have the equivalent code written in a spark script and spark job.

My script runs 3X faster than the job.

Any idea why I am noticing this discrepancy ? Is spark shell using kryo
serialization by default ?

Spark shell: use script ./wordcount.scala

SPARK_MEM=2g ./spark-shell
scala> :load wordcount.scala
Loading wordcount.scala...
inputPath: String = hdfs://x.com:9000/sandbox/data/wordcount/input
outputPath: String = hdfs://x.com:9000/sandbox/data/wordcount/output_spark
start: Long = 1387388284050
file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at
<console>:14
words: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[3] at map at
<console>:17
counts: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at
reduceByKey at <console>:18
end: Long = 1387388301740

Non-cached wordcount runtime 17 sec

Spark job: use org.apache.spark.examples.HdfsWordCount

[debasish@istgbd011 sag_spark]$ SPARK_MEM=2g ./run-example
org.apache.spark.examples.HdfsWordCount spark://x.com:7077 hdfs://
x.com:9000/sandbox/data/wordcount/input
hdfs://x.com:9000/sandbox/data/wordcount/output_spark

Non-cached wordcount runtime 53 sec

I like the 17 sec runtime since it is around 3X faster than exact same code
in scalding and I have not yet utilized the caching feature.

Thanks.
Deb

Reply via email to