Hi,
I am creating initial javaRDD with partition 32 then loop per my data and union
with initial javaRDD I have as follows
JavaRDD<String> dataSetRDD = null;
JavaRDD<String> unionDataSetRDD = null;
For (..) {
If (0 == i) {
unionDataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
} else {
dataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
unionDataSetRDD = unionDataSetRDD.union(dataSetRDD);
}
} //for
System.out.println("unionDataSetRDD: " + unionDataSetRDD.toDebugString());
Output
unionDataSetRDD: UnionRDD[6] at union at DatasetServiceImpl.java:174 (128
partitions)
UnionRDD[4] at union at DatasetServiceImpl.java:174 (96 partitions)
UnionRDD[2] at union at DatasetServiceImpl.java:174 (64 partitions)
ParallelCollectionRDD[0] at parallelize at DatasetServiceImpl.java:167
(32 partitions)
ParallelCollectionRDD[1] at parallelize at DatasetServiceImpl.java:172
(32 partitions)
ParallelCollectionRDD[3] at parallelize at DatasetServiceImpl.java:172 (32
partitions)
ParallelCollectionRDD[5] at parallelize at DatasetServiceImpl.java:172 (32
partitions)
The interesting is my final unionDataSetRDD endup with (128 partitions). I
thought it keep the 32 partitions as I explicitly set in parallelize
Does above make sense?
Thanks,
Hussam