These are excellent questions. Answers below:

On Tue, Apr 22, 2014 at 8:20 AM, wu zeming <zemin...@gmail.com> wrote:

> 1. Why do some transformations like partitionBy, mapVertices cache the new
> graph and some like outerJoinVertices not?


In general, we cache RDDs that are used more than once to avoid
recomputation. In mapVertices, the newVerts RDD is used to construct both
changedVerts and the new graph, so we have to cache it. Good catch that
outerJoinVertices doesn't follow this convention! To be safe, we should be
caching newVerts there as well. Also, partitionBy doesn't need to cache
newEdges, so we should remove the call to cache().

2. I use Pregel api and just use edgeTriplet.srcAttr in sendMsg, after that
> I get a new Graph and I use graph.mapReduceTriplets and use
> edgeTriplet.srcAttr and edgeTriplet.dstAttr in sendMsg. I found that with
> the implement of ReplicatedVertexView, spark will complute all the graph
> which should has been computer before. Can anyone explain the implement
> here?


This is a known problem with the current implementation of join
elimination. If you do an operation that only needs one side of the triplet
(say, the source attribute) followed by an operation that uses both, the
second operation will re-replicate the source attributes unnecessarily. I
wrote a PR that solves this problem by only
moving<https://github.com/ankurdave/graphx/blob/74349e9c81fa626949172d1905dd7713ab6160ac/graphx/src/main/scala/org/apache/spark/graphx/impl/ReplicatedVertexView.scala#L52>the
necessary vertex attributes:
https://github.com/amplab/graphx/pull/137.

3. Why dose not VertexPartition extends Serializable? It's used by RDD.


Though we do create RDD[VertexPartition]s, VertexPartition itself should
never need to be serialized. Instead, we move vertex attributes using
VertexAttributeBlock.

Some other classes in GraphX (GraphImpl, for example) are marked
serializable for a different reason: to make it harmless if closures
accidentally capture them. These classes have all their fields marked
@transient so nothing really gets serialized.

4. Can you provide an "spark.default.cache.useDisk" for using DISK_ONLY in
> cache by default?


This isn't officially supported yet, but I have a
patch<https://github.com/ankurdave/graphx/commit/aa78d782e606a6c0c0f5b8253ae4f408531fd015>that
will let you do it in a hacky way by passing the desired storage level
everywhere.

Ankur <http://www.ankurdave.com/>

Reply via email to