https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L280
On Tue, Dec 17, 2013 at 4:26 PM, Matei Zaharia <[email protected]>wrote: > I’m not sure if a method called repartition() ever existed in an official > release, since we don’t remove methods, but there is a method called > coalesce() that does what you want. You just tell it the desired new number > of partitions. You can also have it shuffle the data across the cluster to > rebalance it. Take a look at > http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD > . > > Matei > > On Dec 17, 2013, at 3:53 PM, Mahdi Namazifar <[email protected]> > wrote: > > > Hi everyone, > > > > I have a question regarding appending two RDDs using the union function, > and I would appreciate if anyone could help me with it. > > > > I have two RDDs (let's call them RDD_1 and RDD_2) with the same number > of partitions (let's say 10) and they are defined based on the rows of the > same set of files that reside on HDFS. In an iterative manner I add some > of the elements of RDD_2 to RDD_1 by > > > > RDD_1.union(RDD_2.filter(x => <some filter>)) > > > > As a result of the above, at each iteration the number of partitions of > RDD_1 is multiplied by 2 (20, 40, 80, 160, ...) and these new partitions > are generally very small in size. In Spark 0.8.0 is there any way to avoid > this exponential increase in the number of partitions or how can I > repartition my RDD_1 to have a reasonable number of partitions after the > iterations. Also is there any other way of appending two RDDs that would > not cause this issue? > > > > I noticed that in the older versions of Spark a repartition function > existed that has been removed in the current version. > > > > Thanks, > > Mahdi > >
