Dell - Internal Use - Confidential
Yes I unioned four RDDs of 32 partitions each.

Thank you,
Hussam

From: Matei Zaharia [mailto:[email protected]]
Sent: Wednesday, November 13, 2013 10:37 PM
To: [email protected]
Subject: Re: interesting finding per using union

Union just puts the data in two RDDs together, so you get an RDD containing the 
elements of both, and with the partitions that would've been in both. It's not 
a unique set union (that would be union() then distinct()). Here you've unioned 
four RDDs of 32 partitions each to get 128. If you want to have fewer 
partitions in the final RDD, but do want to include all that data together, you 
can call coalesce() after unioning them.

Matei

On Nov 13, 2013, at 6:33 PM, 
[email protected]<mailto:[email protected]> wrote:


Hi,

I am creating initial javaRDD with partition 32 then loop per my data and union 
with initial javaRDD I have as follows
JavaRDD<String> dataSetRDD = null;
                    JavaRDD<String> unionDataSetRDD = null;
For (..) {
If (0 == i) {
unionDataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
} else {
dataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
unionDataSetRDD = unionDataSetRDD.union(dataSetRDD);
}
} //for

System.out.println("unionDataSetRDD: " + unionDataSetRDD.toDebugString());

Output
unionDataSetRDD: UnionRDD[6] at union at DatasetServiceImpl.java:174 (128 
partitions)
  UnionRDD[4] at union at DatasetServiceImpl.java:174 (96 partitions)
    UnionRDD[2] at union at DatasetServiceImpl.java:174 (64 partitions)
      ParallelCollectionRDD[0] at parallelize at DatasetServiceImpl.java:167 
(32 partitions)
      ParallelCollectionRDD[1] at parallelize at DatasetServiceImpl.java:172 
(32 partitions)
    ParallelCollectionRDD[3] at parallelize at DatasetServiceImpl.java:172 (32 
partitions)
  ParallelCollectionRDD[5] at parallelize at DatasetServiceImpl.java:172 (32 
partitions)

The interesting is my final unionDataSetRDD endup with (128 partitions).  I 
thought it keep the 32 partitions as I explicitly set in parallelize

Does above make sense?
Thanks,
Hussam

Reply via email to