Repartitioning an RDD

Mahdi Namazifar Tue, 17 Dec 2013 15:54:12 -0800

Hi everyone,

I have a question regarding appending two RDDs using the union function,
and I would appreciate if anyone could help me with it.


I have two RDDs (let's call them RDD_1 and RDD_2) with the same number of
partitions (let's say 10) and they are defined based on the rows of the
same set of files that reside on HDFS.  In an iterative manner I add some
of the elements of RDD_2 to RDD_1 by

RDD_1.union(RDD_2.filter(x => <some filter>))

As a result of the above, at each iteration the number of partitions of
RDD_1 is multiplied by 2 (20, 40, 80, 160, ...) and these new partitions
are generally very small in size.  In Spark 0.8.0 is there any way to avoid
this exponential increase in the number of partitions or how can I
repartition my RDD_1 to have a reasonable number of partitions after the
iterations.  Also is there any other way of appending two RDDs that would
not cause this issue?

I noticed that in the older versions of Spark a repartition function
existed that has been removed in the current version.

Thanks,
Mahdi

Repartitioning an RDD

Reply via email to