Re: Splitting RDD by partition

Sun Rui Fri, 20 May 2016 06:46:39 -0700

I think the latter approach is better, which can avoid un-necessary 
computations by filtering out un-needed partitions.
It is better to cache the previous RDD so that it won’t be computed twice
> On May 20, 2016, at 16:59, shlomi <shlomivak...@gmail.com> wrote:
> 
> Another approach I found:
> 
> First, I make a PartitionsRDD class which only takes a certain range of
> partitions
> ----------------------------------------------------- 
> case class PartitionsRDDPartition(val index:Int, val origSplit:Partition)
> extends Partition {}
> 
> class PartitionsRDD[U: ClassTag](var prev: RDD[U], drop:Int,take:Int)
> extends RDD[U](prev) {
>  override def getPartitions: Array[Partition] =
> prev.partitions.drop(drop).take(take).zipWithIndex.map{case (split,
> idx)=>{new PartitionsRDDPartition(idx,
> split)}}.asInstanceOf[Array[Partition]]
>  override def compute(split: Partition, context: TaskContext): Iterator[U]
> =
> prev.iterator(partitions(split.index).asInstanceOf[PartitionsRDDPartition].origSplit,
> context)
> }
> ----------------------------------------------------- 
> 
> And then I can create my two RDD's using the following:
> ----------------------------------------------------- 
> def splitByPartition[T:ClassTag](rdd: RDD[T], hotPartitions:Int): (RDD[T],
> RDD[T]) = {
>   val left  = new PartitionsRDD[T](rdd, 0, hotPartitions);
>   val right = new PartitionsRDD[T](rdd, hotPartitions,
> rdd.numPartitions-hotPartitions);
>   (left, right)
> }
> ----------------------------------------------------- 
> 
> This approach saves a few minutes when compared to the one in the previous
> post (at least on a local test.. I still need to test this on a real
> cluster).
> 
> Any thought about this? Is this the right thing to do or am I missing
> something important?
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-by-partition-tp26983p26985.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>




---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Splitting RDD by partition

Reply via email to