[ 
https://issues.apache.org/jira/browse/SPARK-13714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-13714:
---------------------------------
    Description: 
Current ConnectedComponents algorithm was based on Min-VertexId Propagation, 
which is sensitive to the place of Min-VertexId.
While this implementation is based on Max-Degree Propagation.
First, the degree graph is computed. And in the pregel progress, the vertex 
with the max degree in a CC is the start point of propagation.
This new method has advantages over the old one:
1, The convergence is only determined by the structs of CC, and is robust to 
the place of vertex with Min-ID.
2, For spherical CCs in which there may be a concept like 'center', it can 
accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), 
the old CC need 4 supersteps, while the new one only need 2 supersteps.
3, If we limit the number of iteration, the new method tend to generate more 
acceptable results.
4, The output for each CC is the vertex with max degree in it, which may be 
more meaningful. And because the vertex-ID is nominal in most cases, the vertex 
with min-ID in a CC is somewhat meanless.

But there are still two disadvantages:
1,The message body grows, from (VID) to (VID, Degree). that is (Long) -> (Long, 
Int)
2,For graph with simple CCs, it may be slower than old one. Because it need a 
extra degree computation.

The api is the same like ConnectedComponents:

val graph = ...
val cc = graph.ConnectedComponentsWithDegree(100)
or
val cc = ConnectedComponentsWithDegree.run(graph, 100)


  was:
Current ConnectedComponents algorithm was based on Min-VertexId Propagation, 
which is sensitive to the place of Min-VertexId.
While this implementation is based on Max-Degree Propagation.
First, the degree graph is computed. And in the pregel progress, the vertex 
with the max degree in a CC is the start point of propagation.
This new method has advantages over the old one:
1, The convergence is only determined by the structs of CC, and is robust to 
the place of vertex with Min-ID.
2, For spherical CCs in which there may be a concept like 'center', it can 
accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), 
the old CC need 4 supersteps, while the new one only need 2 supersteps.
3, If we limit the number of iteration, the new method tend to generate more 
acceptable results.
4, The output for each CC is the vertex with max degree in it, which may be 
more meaningful. And because the vertex-ID is nominal in most cases, the vertex 
with min-ID in a CC is somewhat meanless.

But there are still two disadvantages:
1,The message boy grows, from (VID) to (VID, Degree). that is (Long) -> (Long, 
Int)
2,For graph with simple CCs, it may be slower than old one. Because it need a 
extra degree computation.

The api is the same like ConnectedComponents:

val graph = ...
val cc = graph.ConnectedComponentsWithDegree(100)
or
val cc = ConnectedComponentsWithDegree.run(graph, 100)



> Another ConnectedComponents based on Max-Degree Propagation
> -----------------------------------------------------------
>
>                 Key: SPARK-13714
>                 URL: https://issues.apache.org/jira/browse/SPARK-13714
>             Project: Spark
>          Issue Type: New Feature
>          Components: GraphX
>            Reporter: zhengruifeng
>            Priority: Minor
>
> Current ConnectedComponents algorithm was based on Min-VertexId Propagation, 
> which is sensitive to the place of Min-VertexId.
> While this implementation is based on Max-Degree Propagation.
> First, the degree graph is computed. And in the pregel progress, the vertex 
> with the max degree in a CC is the start point of propagation.
> This new method has advantages over the old one:
> 1, The convergence is only determined by the structs of CC, and is robust to 
> the place of vertex with Min-ID.
> 2, For spherical CCs in which there may be a concept like 'center', it can 
> accelerate the convergence. For example, GraphGenerators.gridGraph(sc, 3, 3), 
> the old CC need 4 supersteps, while the new one only need 2 supersteps.
> 3, If we limit the number of iteration, the new method tend to generate more 
> acceptable results.
> 4, The output for each CC is the vertex with max degree in it, which may be 
> more meaningful. And because the vertex-ID is nominal in most cases, the 
> vertex with min-ID in a CC is somewhat meanless.
> But there are still two disadvantages:
> 1,The message body grows, from (VID) to (VID, Degree). that is (Long) -> 
> (Long, Int)
> 2,For graph with simple CCs, it may be slower than old one. Because it need a 
> extra degree computation.
> The api is the same like ConnectedComponents:
> val graph = ...
> val cc = graph.ConnectedComponentsWithDegree(100)
> or
> val cc = ConnectedComponentsWithDegree.run(graph, 100)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to