I think the latter approach is better, which can avoid un-necessary computations by filtering out un-needed partitions. It is better to cache the previous RDD so that it won’t be computed twice > On May 20, 2016, at 16:59, shlomi <shlomivak...@gmail.com> wrote: > > Another approach I found: > > First, I make a PartitionsRDD class which only takes a certain range of > partitions > ----------------------------------------------------- > case class PartitionsRDDPartition(val index:Int, val origSplit:Partition) > extends Partition {} > > class PartitionsRDD[U: ClassTag](var prev: RDD[U], drop:Int,take:Int) > extends RDD[U](prev) { > override def getPartitions: Array[Partition] = > prev.partitions.drop(drop).take(take).zipWithIndex.map{case (split, > idx)=>{new PartitionsRDDPartition(idx, > split)}}.asInstanceOf[Array[Partition]] > override def compute(split: Partition, context: TaskContext): Iterator[U] > = > prev.iterator(partitions(split.index).asInstanceOf[PartitionsRDDPartition].origSplit, > context) > } > ----------------------------------------------------- > > And then I can create my two RDD's using the following: > ----------------------------------------------------- > def splitByPartition[T:ClassTag](rdd: RDD[T], hotPartitions:Int): (RDD[T], > RDD[T]) = { > val left = new PartitionsRDD[T](rdd, 0, hotPartitions); > val right = new PartitionsRDD[T](rdd, hotPartitions, > rdd.numPartitions-hotPartitions); > (left, right) > } > ----------------------------------------------------- > > This approach saves a few minutes when compared to the one in the previous > post (at least on a local test.. I still need to test this on a real > cluster). > > Any thought about this? Is this the right thing to do or am I missing > something important? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Splitting-RDD-by-partition-tp26983p26985.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org