On Tue, Apr 8, 2014 at 2:33 PM, Adam Novak <ano...@soe.ucsc.edu> wrote:

> What, exactly, needs to be true about the RDDs that you pass to Graph() to
> be sure of constructing a valid graph? (Do they need to have the same
> number of partitions? The same number of partitions and no empty
> partitions? Do you need to repartition them with their default partitioners
> beforehand?)
>

In theory, there should be no preconditions on the vertex and edge RDDs
passed to Graph(). They may each have any number of empty or nonempty
partitions and can be partitioned arbitrarily. GraphX will repartition the
vertex RDD as described below, and it should not repartition the edge RDD
unless you do this explicitly by calling Graph#partitionBy.

Unfortunately, there was a bug
(SPARK-1329<https://issues.apache.org/jira/browse/SPARK-1329>)
that caused an ArrayIndexOutOfBoundsException when the edge RDD had more
partitions than the vertex RDD. I just submitted
apache/spark#368<https://github.com/apache/spark/pull/368>,
which should fix this. Could you try applying it and see if that helps?

Why does GraphImpl repartition the vertices RDD?


This is for two reasons:
1. To remove duplicate vertices by colocating them on the same partition:
VertexPartition.apply<https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/VertexPartition.scala#L42>
.
2. To enable aggregating messages to a vertex by hashing the target vertex
ID using the vertex partitioner and sending the message to the resulting
partition: 
VertexRDD#aggregateUsingIndex<https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala#L285>
.

Ideally we would skip repartitioning the vertex RDD if it's already
partitioned, though.

Ankur <http://www.ankurdave.com/>

Reply via email to