I've been trying to reproduce this but I haven't succeeded so far. For
example, on the web-Google
<https://snap.stanford.edu/data/web-Google.html>graph, I get the
expected results both on v0.9.1-handle-empty-partitions
and on master:

// Load web-Google and run connected componentsimport
org.apache.spark.graphx._val g = GraphLoader.edgeListFile(sc,
"/Users/ankurdave/Downloads/web-Google.txt",
  minEdgePartitions=8)
g.vertices.count // => 875713val cc =
g.connectedComponents.vertices.map(_._2).cache()
cc.count // => 875713val counts = cc.countByValue
counts.values.sum // => 875713// There should not be any single-vertex
components, because we loaded an edge listcounts.count(_._2 == 0) //
=> 0counts.count(_._2 == 1) // => 0counts.count(_._2 == 2) // =>
783counts.count(_._2 == 3) // => 503// The 3 smallest and largest
components in the graph (with nondeterministic
tiebreaking)counts.toArray.sortBy(_._2).take(3) // =>
Array((418467,2), (272504,2),
(719750,2))counts.toArray.sortBy(_._2).takeRight(3) // =>
Array((1363,384), (1734,404), (0,855802))


Ankur <http://www.ankurdave.com/>

Reply via email to