Hi there,
My team created a class extending RDD, and in the getPartitions method of which we have a parallelized job. We noticed Spark hangs if we do shuffling on our RDD instance. I’m just wondering if it’s a valid use case and if the Spark team could provide us with some suggestion. We are using Spark 1.6.0 and here’s our code snippet: import org.apache.spark.{Partition, SparkContext, TaskContext} import org.apache.spark.rdd.RDD class testRDD(@transient sc: SparkContext) extends RDD[(String, Int)](sc, Nil) with Serializable{ override def getPartitions: Array[Partition] = { sc.parallelize(Seq(("a",1),("b",2))).reduceByKey(_+_).collect() val result = new Array[Partition](4) for (i <- 0 until 4) { result(i) = new Partition { override def index: Int = 0 } } result } override def compute(split: Partition, context: TaskContext): Iterator[(String,Int)] = Seq(("a",3),("b",4)).iterator } val y = new testRDD(sc) y.map(r => r).reduceByKey(_+_).count() Regards, Yanyan Zhang