When joining two VertexRDDs with identical indexes, GraphX can use a fast
code path (a zip join without any hash lookups). However, the check for
identical indexes is performed using reference equality.

Without caching, two copies of the index are created. Although the two
indexes are structurally identical, they fail reference equality, and so
GraphX mistakenly uses the slow path involving a hash lookup per joined
element.

I'm working on a patch <https://github.com/apache/spark/pull/1297> that
attempts an optimistic zip join with per-element fallback to hash lookups,
which would improve this situation.

Ankur <http://www.ankurdave.com/>

Reply via email to