I've been trying to reproduce this but I haven't succeeded so far. For example, on the web-Google <https://snap.stanford.edu/data/web-Google.html>graph, I get the expected results both on v0.9.1-handle-empty-partitions and on master:
// Load web-Google and run connected componentsimport org.apache.spark.graphx._val g = GraphLoader.edgeListFile(sc, "/Users/ankurdave/Downloads/web-Google.txt", minEdgePartitions=8) g.vertices.count // => 875713val cc = g.connectedComponents.vertices.map(_._2).cache() cc.count // => 875713val counts = cc.countByValue counts.values.sum // => 875713// There should not be any single-vertex components, because we loaded an edge listcounts.count(_._2 == 0) // => 0counts.count(_._2 == 1) // => 0counts.count(_._2 == 2) // => 783counts.count(_._2 == 3) // => 503// The 3 smallest and largest components in the graph (with nondeterministic tiebreaking)counts.toArray.sortBy(_._2).take(3) // => Array((418467,2), (272504,2), (719750,2))counts.toArray.sortBy(_._2).takeRight(3) // => Array((1363,384), (1734,404), (0,855802)) Ankur <http://www.ankurdave.com/>