Hi, I have the equivalent code written in a spark script and spark job.
My script runs 3X faster than the job. Any idea why I am noticing this discrepancy ? Is spark shell using kryo serialization by default ? Spark shell: use script ./wordcount.scala SPARK_MEM=2g ./spark-shell scala> :load wordcount.scala Loading wordcount.scala... inputPath: String = hdfs://x.com:9000/sandbox/data/wordcount/input outputPath: String = hdfs://x.com:9000/sandbox/data/wordcount/output_spark start: Long = 1387388284050 file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at <console>:14 words: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[3] at map at <console>:17 counts: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at reduceByKey at <console>:18 end: Long = 1387388301740 Non-cached wordcount runtime 17 sec Spark job: use org.apache.spark.examples.HdfsWordCount [debasish@istgbd011 sag_spark]$ SPARK_MEM=2g ./run-example org.apache.spark.examples.HdfsWordCount spark://x.com:7077 hdfs:// x.com:9000/sandbox/data/wordcount/input hdfs://x.com:9000/sandbox/data/wordcount/output_spark Non-cached wordcount runtime 53 sec I like the 17 sec runtime since it is around 3X faster than exact same code in scalding and I have not yet utilized the caching feature. Thanks. Deb
