You can use the UI to debug what is going on. For instance, are the tasks themselves taking longer (?) or is it possible that the overall job acquires fewer executors (?). Another thing influencing this is that when you run a job it counts the time it takes to go and start-up all the executors. When you launch the shell you might be ignoring that time because you only start counting once the shell is launched (this depends how you are measuring the time).
- Patrick On Wed, Dec 18, 2013 at 11:29 PM, Debasish Das <[email protected]> wrote: > Hi, > > I have the equivalent code written in a spark script and spark job. > > My script runs 3X faster than the job. > > Any idea why I am noticing this discrepancy ? Is spark shell using kryo > serialization by default ? > > Spark shell: use script ./wordcount.scala > > SPARK_MEM=2g ./spark-shell > scala> :load wordcount.scala > Loading wordcount.scala... > inputPath: String = hdfs://x.com:9000/sandbox/data/wordcount/input > outputPath: String = hdfs://x.com:9000/sandbox/data/wordcount/output_spark > start: Long = 1387388284050 > file: org.apache.spark.rdd.RDD[String] = MappedRDD[1] at textFile at > <console>:14 > words: org.apache.spark.rdd.RDD[(String, Int)] = MappedRDD[3] at map at > <console>:17 > counts: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[6] at > reduceByKey at <console>:18 > end: Long = 1387388301740 > > Non-cached wordcount runtime 17 sec > > Spark job: use org.apache.spark.examples.HdfsWordCount > > [debasish@istgbd011 sag_spark]$ SPARK_MEM=2g ./run-example > org.apache.spark.examples.HdfsWordCount spark://x.com:7077 > hdfs://x.com:9000/sandbox/data/wordcount/input > hdfs://x.com:9000/sandbox/data/wordcount/output_spark > > Non-cached wordcount runtime 53 sec > > I like the 17 sec runtime since it is around 3X faster than exact same code > in scalding and I have not yet utilized the caching feature. > > Thanks. > Deb
