At 2014-10-28 16:27:20 +0300, Zuhair Khayyat <zuhair.khay...@gmail.com> wrote:
> I am using connected components function of GraphX (on Spark 1.0.2) on some
> graph. However for some reason the fails with StackOverflowError. The graph
> is not too big; it contains 10000 vertices and 500000 edges.
>
> [...]
> 14/10/28 16:08:50 INFO DAGScheduler: Submitting 1 missing tasks from Stage
> 13 (VertexRDD.createRoutingTables - vid2pid (aggregation)
> MapPartitionsRDD[13] at mapPartitions at VertexRDD.scala:423)

This seems like a bug in Scala's implementation of quicksort, maybe SI-7837 
[1]. GraphX uses Scala's quicksort to sort the edges when loading them into 
memory.

If you're able to modify Spark, you could avoid using quicksort by changing 
EdgePartitionBuilder.scala:40 from

    Sorting.quickSort(edgeArray)(Edge.lexicographicOrdering)

to

    implicit val ordering: Ordering[Edge[ED]] = Edge.lexicographicOrdering
    Sorting.stableSort(edgeArray)

Otherwise, the workaround will depend on the cause of the bug. If it's because 
you have thousands of duplicates of the same edge which are triggering SI-7837, 
you could use RDD#distinct to remove them before constructing the graph. If 
it's because of an unlikely worst-case data distribution that's causing 
quicksort to run in linear time, you could reorder the edges using RDD#sortBy.

Ankur

[1] https://issues.scala-lang.org/browse/SI-7837

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to