Re: Repartitioning an RDD

Matei Zaharia Tue, 17 Dec 2013 16:27:12 -0800

I’m not sure if a method called repartition() ever existed in an official 
release, since we don’t remove methods, but there is a method called coalesce() 
that does what you want. You just tell it the desired new number of partitions. 
You can also have it shuffle the data across the cluster to rebalance it. Take 
a look at 
http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD.


Matei

On Dec 17, 2013, at 3:53 PM, Mahdi Namazifar <mahdi.namazi...@gmail.com> wrote:

> Hi everyone,
> 
> I have a question regarding appending two RDDs using the union function, and 
> I would appreciate if anyone could help me with it.
> 
> I have two RDDs (let's call them RDD_1 and RDD_2) with the same number of 
> partitions (let's say 10) and they are defined based on the rows of the same 
> set of files that reside on HDFS.  In an iterative manner I add some of the 
> elements of RDD_2 to RDD_1 by
> 
> RDD_1.union(RDD_2.filter(x => <some filter>))
> 
> As a result of the above, at each iteration the number of partitions of RDD_1 
> is multiplied by 2 (20, 40, 80, 160, ...) and these new partitions are 
> generally very small in size.  In Spark 0.8.0 is there any way to avoid this 
> exponential increase in the number of partitions or how can I repartition my 
> RDD_1 to have a reasonable number of partitions after the iterations.  Also 
> is there any other way of appending two RDDs that would not cause this issue?
> 
> I noticed that in the older versions of Spark a repartition function existed 
> that has been removed in the current version.
> 
> Thanks,
> Mahdi

Re: Repartitioning an RDD

Reply via email to