On Tue, Apr 8, 2014 at 2:33 PM, Adam Novak <ano...@soe.ucsc.edu> wrote:
> What, exactly, needs to be true about the RDDs that you pass to Graph() to > be sure of constructing a valid graph? (Do they need to have the same > number of partitions? The same number of partitions and no empty > partitions? Do you need to repartition them with their default partitioners > beforehand?) > In theory, there should be no preconditions on the vertex and edge RDDs passed to Graph(). They may each have any number of empty or nonempty partitions and can be partitioned arbitrarily. GraphX will repartition the vertex RDD as described below, and it should not repartition the edge RDD unless you do this explicitly by calling Graph#partitionBy. Unfortunately, there was a bug (SPARK-1329<https://issues.apache.org/jira/browse/SPARK-1329>) that caused an ArrayIndexOutOfBoundsException when the edge RDD had more partitions than the vertex RDD. I just submitted apache/spark#368<https://github.com/apache/spark/pull/368>, which should fix this. Could you try applying it and see if that helps? Why does GraphImpl repartition the vertices RDD? This is for two reasons: 1. To remove duplicate vertices by colocating them on the same partition: VertexPartition.apply<https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/VertexPartition.scala#L42> . 2. To enable aggregating messages to a vertex by hashing the target vertex ID using the vertex partitioner and sending the message to the resulting partition: VertexRDD#aggregateUsingIndex<https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L285> . Ideally we would skip repartitioning the vertex RDD if it's already partitioned, though. Ankur <http://www.ankurdave.com/>