Hi there,

My team created a class extending RDD, and in the getPartitions method of which 
we have a parallelized job. We noticed Spark hangs if we do shuffling on our 
RDD instance.

I’m just wondering if it’s a valid use case and if the Spark team could provide 
us with some suggestion.

We are using Spark 1.6.0 and here’s our code snippet:

import org.apache.spark.{Partition, SparkContext, TaskContext}
import org.apache.spark.rdd.RDD

class testRDD(@transient sc: SparkContext)
  extends RDD[(String, Int)](sc, Nil)
    with Serializable{

  override def getPartitions: Array[Partition] = {
    sc.parallelize(Seq(("a",1),("b",2))).reduceByKey(_+_).collect()

    val result = new Array[Partition](4)
    for (i <- 0 until 4) {
      result(i) = new Partition {
        override def index: Int = 0
      }
    }
    result
  }

  override def compute(split: Partition, context: TaskContext):
    Iterator[(String,Int)] = Seq(("a",3),("b",4)).iterator
}

val y = new testRDD(sc)
y.map(r => r).reduceByKey(_+_).count()

Regards,
Yanyan Zhang




Reply via email to