When joining two VertexRDDs with identical indexes, GraphX can use a fast code path (a zip join without any hash lookups). However, the check for identical indexes is performed using reference equality.
Without caching, two copies of the index are created. Although the two indexes are structurally identical, they fail reference equality, and so GraphX mistakenly uses the slow path involving a hash lookup per joined element. I'm working on a patch <https://github.com/apache/spark/pull/1297> that attempts an optimistic zip join with per-element fallback to hash lookups, which would improve this situation. Ankur <http://www.ankurdave.com/>