It is possible for kmeans to fail to assign any points to one of its clusters, and I believe this is the explanation for your seeing only 4 clusters when you requested 5. Also, looking at your clustered data, the points are printed out in sparse vector notation (index:value) and your input vectors only appear to have 2 or 3 nonzero elements. This could be the reason for dropping one of the clusters.
-----Original Message----- From: Abhik Banerjee [mailto:[email protected]] Sent: Friday, July 29, 2011 3:48 PM To: [email protected] Subject: Analyzing the clusterdump output - kmeans clustering Hi, I managed to run the kmeans algorithm on a cloudera vm , using the help provided at the wiki and help at the forum . I got my output and am trying to use the clusterdump to analyze my result. (I seemed to give 5 iterations , but it seems to have formed only 4 clusters , I am also curious about that , I ran this below command ) mahout kmeans -i hdfs://localhost/mahout_input/ip -o hdfs://localhost/mahout_output/output_kmeans_07_29_1/ -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 1.0 -c hdfs://localhost/mahout_input/centroids_07_29_1 -k 5 -x 5 -cl after k means completion on hadoop cloudera vm I ran this command :- mahout clusterdump --seqFileDir hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusters-5/part-r-00000 --pointsDir hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusteredPoints --output kmeans_07_29_1_cl5.tx and when I look into the text file I see a structure like this CL-99871{n=10157 c=[186:12.229, 189:9.343, 212:2.716] r=[186:7.803, 189:8.054, 2 12:4.686]} Weight: Point: 1.0: 1.161.199.19 = [186:22.000, 189:32.000] 1.0: 1.161.204.226 = [186:9.000, 189:11.000] 1.0: 1.170.149.79 = [186:18.000, 189:10.000] 1.0: 1.175.137.84 = [186:23.000, 189:8.000] 1.0: 1.176.27.109 = [186:7.000, 189:9.000, 212:3.000] 1.0: 1.177.175.26 = [186:12.000, 189:12.000] 1.0: 1.197.208.25 = [186:26.000] 1.0: 1.212.176.27 = [186:11.000, 189:1.000] 1.0: 1.212.176.28 = [186:11.000, 189:6.000] 1.0: 1.22.160.35 = [186:17.000, 189:6.000] 1.0: 1.230.123.81 = [186:18.000, 189:4.000] I can figure the first part of it , as explained in the wiki , that the name is CL-99871 , number of points is 10157 , cluster center is [ ] in the vector form , radius is [ ] , I dont understand how the later part of it is structured , the Ip addresses are my name - data points which I wanted to get clustered, what do those vector values mean , if they mean the vectors of those points , I am not sure why they are only 2 dimensional as my original data points were consisting of 288 dimensions , for each ip address. Thanks for all the help, Abhik
