Hi,
consider the following code:
import org.apache.spark.{SparkContext, SparkConf}
object ParallelismBug extends App {
var sConf = new SparkConf()
.setMaster("spark://hostName:7077") // .setMaster("local[4]")
.set("spark.default.parallelism", "7") // or without it
val sc = new SparkContext(sConf)
val rdd = sc.textFile("input/100") // val rdd =
sc.parallelize(Array.range(1, 100))
val rdd2 = rdd.intersection(rdd)
println("rdd: " + rdd.partitions.size + " rdd2: " + rdd2.partitions.size)
}
Suppose that input/100 contains 100 files. In above configuration output is
rdd: 100 rdd2: 7, which seems ok. when we don't set parallelism then output
is rdd: 100 rdd2: 100, but according to
https://spark.apache.org/docs/latest/configuration.html#execution-behavior
it should be rdd: 100 rdd2: 2 (on my 1 core machine).
But when rdd is defined using sc.parallelize results seems ok: rdd: 2 rdd2:
2.
Moreover when master is local[4] and we set parallelism then result is rdd:
100 rdd2: 4 instead of rdd: 100 rdd2: 7. And when we don't set parallelism
it behaves like with master spark://hostName:7077.
Do I misunderstanding something, or is it a bug?
Thanks,
Grzegorz