Re: Repartitioning an RDD

Mark Hamstra Tue, 17 Dec 2013 16:32:12 -0800

https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L280



On Tue, Dec 17, 2013 at 4:26 PM, Matei Zaharia <[email protected]>wrote:

> I’m not sure if a method called repartition() ever existed in an official
> release, since we don’t remove methods, but there is a method called
> coalesce() that does what you want. You just tell it the desired new number
> of partitions. You can also have it shuffle the data across the cluster to
> rebalance it. Take a look at
> http://spark.incubator.apache.org/docs/latest/api/core/index.html#org.apache.spark.rdd.RDD
> .
>
> Matei
>
> On Dec 17, 2013, at 3:53 PM, Mahdi Namazifar <[email protected]>
> wrote:
>
> > Hi everyone,
> >
> > I have a question regarding appending two RDDs using the union function,
> and I would appreciate if anyone could help me with it.
> >
> > I have two RDDs (let's call them RDD_1 and RDD_2) with the same number
> of partitions (let's say 10) and they are defined based on the rows of the
> same set of files that reside on HDFS.  In an iterative manner I add some
> of the elements of RDD_2 to RDD_1 by
> >
> > RDD_1.union(RDD_2.filter(x => <some filter>))
> >
> > As a result of the above, at each iteration the number of partitions of
> RDD_1 is multiplied by 2 (20, 40, 80, 160, ...) and these new partitions
> are generally very small in size.  In Spark 0.8.0 is there any way to avoid
> this exponential increase in the number of partitions or how can I
> repartition my RDD_1 to have a reasonable number of partitions after the
> iterations.  Also is there any other way of appending two RDDs that would
> not cause this issue?
> >
> > I noticed that in the older versions of Spark a repartition function
> existed that has been removed in the current version.
> >
> > Thanks,
> > Mahdi
>
>

Re: Repartitioning an RDD

Reply via email to