Hello List, I was wondering what is the design principle that partition size of an RDD is inherited from the parent. See one simple example below [*]. 'ngauss_rdd2' has significantly less data, intuitively in such cases, shouldn't spark invoke coalesce automatically for performance? What would be the configuration option for this if there is any?
Best, -m [*] // Generate 1 million Gaussian random numbers import util.Random Random.setSeed(4242) val ngauss = (1 to 1e6.toInt).map(x=>Random.nextGaussian) val ngauss_rdd = sc.parallelize(ngauss) ngauss_rdd.count // 1 million ngauss_rdd.partitions.size // 4 val ngauss_rdd2 = ngauss_rdd.filter(x=>x > 4.0) ngauss_rdd2.count // 35 ngauss_rdd2.partitions.size // 4 --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org