Hi everyone, I have a question regarding appending two RDDs using the union function, and I would appreciate if anyone could help me with it.
I have two RDDs (let's call them RDD_1 and RDD_2) with the same number of partitions (let's say 10) and they are defined based on the rows of the same set of files that reside on HDFS. In an iterative manner I add some of the elements of RDD_2 to RDD_1 by RDD_1.union(RDD_2.filter(x => <some filter>)) As a result of the above, at each iteration the number of partitions of RDD_1 is multiplied by 2 (20, 40, 80, 160, ...) and these new partitions are generally very small in size. In Spark 0.8.0 is there any way to avoid this exponential increase in the number of partitions or how can I repartition my RDD_1 to have a reasonable number of partitions after the iterations. Also is there any other way of appending two RDDs that would not cause this issue? I noticed that in the older versions of Spark a repartition function existed that has been removed in the current version. Thanks, Mahdi
