Canopy Clustering can be used to find the initial centroids. This might give some stability in the result ( number of iterations taken to converge, and also the clusters found ) . However, its not guaranteed that each time the centroids found by Canopy Clustering would be same. ________________________________________ From: Joshi, Shrinivas [[email protected]] Sent: Monday, June 18, 2012 6:33 PM To: [email protected] Subject: KMeans with ASFEmail archive data set
Hi, I have been looking at KMeans clustering of ASFEmail archive data set using the script that is part of the examples directory. This is with Mahout 0.6, Hadoop 1.0.3 and JDK 7 u4 stack. I have noticed that sometimes the algorithm converges in 1 iteration (randomSeed iteration + a clustering iteration) and sometimes it takes 5 iterations. This is probably due to how the initial centroids get picked. Is this expected behavior? Is there any way to make the initial centroid selection uniformly random? Thanks, -Shrinivas
