At 2014-10-28 16:27:20 +0300, Zuhair Khayyat <zuhair.khay...@gmail.com> wrote: > I am using connected components function of GraphX (on Spark 1.0.2) on some > graph. However for some reason the fails with StackOverflowError. The graph > is not too big; it contains 10000 vertices and 500000 edges. > > [...] > 14/10/28 16:08:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage > 13 (VertexRDD.createRoutingTables - vid2pid (aggregation) > MapPartitionsRDD[13] at mapPartitions at VertexRDD.scala:423)
This seems like a bug in Scala's implementation of quicksort, maybe SI-7837 [1]. GraphX uses Scala's quicksort to sort the edges when loading them into memory. If you're able to modify Spark, you could avoid using quicksort by changing EdgePartitionBuilder.scala:40 from Sorting.quickSort(edgeArray)(Edge.lexicographicOrdering) to implicit val ordering: Ordering[Edge[ED]] = Edge.lexicographicOrdering Sorting.stableSort(edgeArray) Otherwise, the workaround will depend on the cause of the bug. If it's because you have thousands of duplicates of the same edge which are triggering SI-7837, you could use RDD#distinct to remove them before constructing the graph. If it's because of an unlikely worst-case data distribution that's causing quicksort to run in linear time, you could reorder the edges using RDD#sortBy. Ankur [1] https://issues.scala-lang.org/browse/SI-7837 --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org