Union just puts the data in two RDDs together, so you get an RDD containing the 
elements of both, and with the partitions that would’ve been in both. It’s not 
a unique set union (that would be union() then distinct()). Here you’ve unioned 
four RDDs of 32 partitions each to get 128. If you want to have fewer 
partitions in the final RDD, but do want to include all that data together, you 
can call coalesce() after unioning them.

Matei

On Nov 13, 2013, at 6:33 PM, hussam_jar...@dell.com wrote:

> Hi,
>  
> I am creating initial javaRDD with partition 32 then loop per my data and 
> union with initial javaRDD I have as follows
> JavaRDD<String> dataSetRDD = null;
>                     JavaRDD<String> unionDataSetRDD = null;
> For (..) {
> If (0 == i) {
> unionDataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
> } else {
> dataSetRDD = SparkDriver.getSparkContext().parallelize(finalresult, 32);
> unionDataSetRDD = unionDataSetRDD.union(dataSetRDD);
> }
> } //for
>  
> System.out.println("unionDataSetRDD: " + unionDataSetRDD.toDebugString());
>  
> Output
> unionDataSetRDD: UnionRDD[6] at union at DatasetServiceImpl.java:174 (128 
> partitions)
>   UnionRDD[4] at union at DatasetServiceImpl.java:174 (96 partitions)
>     UnionRDD[2] at union at DatasetServiceImpl.java:174 (64 partitions)
>       ParallelCollectionRDD[0] at parallelize at DatasetServiceImpl.java:167 
> (32 partitions)
>       ParallelCollectionRDD[1] at parallelize at DatasetServiceImpl.java:172 
> (32 partitions)
>     ParallelCollectionRDD[3] at parallelize at DatasetServiceImpl.java:172 
> (32 partitions)
>   ParallelCollectionRDD[5] at parallelize at DatasetServiceImpl.java:172 (32 
> partitions)
>  
> The interesting is my final unionDataSetRDD endup with (128 partitions).  I 
> thought it keep the 32 partitions as I explicitly set in parallelize
>  
> Does above make sense?
> Thanks,
> Hussam

Reply via email to