Thanks a lot for the explanation Matei.
As a matter of fact, I was just reading up on the paper on the Narrow and
Wide Dependencies and saw that groupByKey is indeed a wide dependency which,
as you explained, is the problem.
Maybe it wouldn't be a bad thing to have a section in the docs on the
wi
1 /
pair._2._2)}.collectAsMap()
Afterwards the change of the new centroids is calculated in order to know
when to stop iterating:
tempDist = 0.0
for (i <- 0 until K) {
tempDist += kPoints(i).squaredDist(newPoints(i))
}
*my algorithm *
(https://github.com/ticup/k-means-spark/blob/master