I'm using pyspark. I was wondering how to modify the number of partitions for the result task (reduce in my case). I'm running Spark on a cluster of two machines (each with 16 cores). Here's the relevant log output for my result stage:
13/11/17 23:16:47 INFO SparkContext: time: 18851958895218046 *13/11/17 23:16:47 INFO SparkContext: partition length: 2* 13/11/17 23:16:47 DEBUG DAGScheduler: Got event of type org.apache.spark.scheduler.JobSubmitted 13/11/17 23:16:47 INFO DAGScheduler: class of RDD: class org.apache.spark.api.python.PythonRDD PythonRDD[6] at RDD at PythonRDD.scala:34 13/11/17 23:16:47 INFO DAGScheduler: number of dependencies: 1 13/11/17 23:16:47 INFO DAGScheduler: class of dep: class org.apache.spark.rdd.MappedRDD MappedRDD[5] at values at NativeMethodAccessorImpl.java:-2 13/11/17 23:16:47 INFO DAGScheduler: class of RDD: class org.apache.spark.rdd.MappedRDD MappedRDD[5] at values at NativeMethodAccessorImpl.java:-2 13/11/17 23:16:47 INFO DAGScheduler: number of dependencies: 1 13/11/17 23:16:47 INFO DAGScheduler: class of dep: class org.apache.spark.rdd.ShuffledRDD ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:-2 13/11/17 23:16:47 INFO DAGScheduler: class of RDD: class org.apache.spark.rdd.ShuffledRDD ShuffledRDD[4] at partitionBy at NativeMethodAccessorImpl.java:-2 13/11/17 23:16:47 INFO DAGScheduler: number of dependencies: 1 In this case, Spark seems to automatically configure the number of partitions for the result tasks to be 2. The result is that only two reduce tasks run (one on each machine). Is there a way to modify this number? More generally how do you configure the number of reduce tasks? thanks! Umar