Hi Grzegorz
From my understanding, for cogroup operation ( which used by
intersection), if spark.default.parallelism is not set by user, it won’t bother
to use the default value, it will use the partition number ( the max one among
all the rdds in cogroup operation) to build up a partitioner ( if non of the
rdd already has a partitioner). This is intend to avoid OOM when process a
single task.
So it explain most of your observations. TextFile generated RDD use
file split number as partition number and parallelize operation use
spark.default.parallelism as default partition number.
But this not explain your local[4] case use textfile for input and with
spark.default.parallelism set to “7” , the result for rdd2 partition count is 4
in this case? Seems to me should not happen.
Best Regards,
Raymond Liu
From: Grzegorz Białek [mailto:[email protected]]
Sent: Tuesday, August 26, 2014 7:52 PM
To: [email protected]
Subject: spark.default.parallelism bug?
Hi,
consider the following code:
import org.apache.spark.{SparkContext, SparkConf}
object ParallelismBug extends App {
var sConf = new SparkConf()
.setMaster("spark://hostName:7077") // .setMaster("local[4]")
.set("spark.default.parallelism", "7") // or without it
val sc = new SparkContext(sConf)
val rdd = sc.textFile("input/100") // val rdd = sc.parallelize(Array.range(1,
100))
val rdd2 = rdd.intersection(rdd)
println("rdd: " + rdd.partitions.size + " rdd2: " + rdd2.partitions.size)
}
Suppose that input/100 contains 100 files. In above configuration output is
rdd: 100 rdd2: 7, which seems ok. when we don't set parallelism then output is
rdd: 100 rdd2: 100, but according to
https://spark.apache.org/docs/latest/configuration.html#execution-behavior
it should be rdd: 100 rdd2: 2 (on my 1 core machine).
But when rdd is defined using sc.parallelize results seems ok: rdd: 2 rdd2: 2.
Moreover when master is local[4] and we set parallelism then result is rdd: 100
rdd2: 4 instead of rdd: 100 rdd2: 7. And when we don't set parallelism it
behaves like with master spark://hostName:7077.
Do I misunderstanding something, or is it a bug?
Thanks,
Grzegorz