Actually, GraphX doesn't need to scan all the edges, because it
maintains a clustered index on the source vertex id (that is, it sorts
the edges by source vertex id and stores the offsets in a hash table).
If the activeDirection is appropriately set, it can then jump only to
the clusters with active source vertices.

See the EdgePartition#index field [1], which stores the offsets, and
the logic in GraphImpl#aggregateMessagesWithActiveSet [2], which
decides whether to do a full scan or use the index.

[1] 
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/EdgePartition.scala#L60
[2]. 
https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L237-266

Ankur


On Thu, Apr 9, 2015 at 3:21 AM, James <alcaid1...@gmail.com> wrote:
> In aggregateMessagesWithActiveSet, Spark still have to read all edges. It
> means that a fixed time which scale with graph size is unavoidable on a
> pregel-like iteration.
>
> But what if I have to iterate nearly 100 iterations but at the last 50
> iterations there are only < 0.1% nodes need to be updated ? The fixed time
> make the program finished at a unacceptable time consumption.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to