Kmeans will not create new clusters beyond what it is given in the prior. When you ran canopy I suspect you actually got 12 clusters there too (run clusterdump on clusters-0 to verify). The CL clusters have not yet converged to VL clusters, so you either need a larger -cd value or you need to increase the number of iterations (-x) so they will converge. The fact that most of your points ended up in CL-0 indicates they were mostly alike. That's what clustering does :).
I'm a little suspicious of your input data. You claim it has 288 dimensions, but all of your clusters show centroid values in the 186, 189 and 212 element values only. This suggests to me that all of the 285 other element values are zero. I'd recheck my input vectors to make sure you are really getting the sort of input you desire. Other than this, it looks like you are on the right track. -----Original Message----- From: Abhik Banerjee [mailto:[email protected]] Sent: Tuesday, August 02, 2011 10:01 AM To: [email protected] Subject: Re: Analyzing the clusterdump output - kmeans clustering HI, I am still having a few issues in interpreting the result from the kmeans cluster dump output, I shall be thankful if someone can help me out with it. I first used the canopy clustering which gave me 11 centroid points and then I used the file for kmeans mahout code to generate the output folder using this code ( I ran it for 8 iterations , --------mahout kmeans -i /var/tmp/input_kmeans_08_01_01/ipfull -o /var/tmp/output_kmeans_08_01_03_50_200/ -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.01 -c hdfs:/var/tmp/output_canopy_08_01_03_50_200/clusters-0/ -cl -x 8 which gave me a folder with cluster-0 to clusters-8 and a folder with the name clusteredPoints, after which I used the cliusterdump command and got the following result. -------- mahout clusterdump --seqFileDir hdfs:/var/tmp/output_kmeans_08_01_04_75_250/clusters-8/part-r-00000 --pointsDir hdfs:/var/tmp/output_kmeans_08_01_04_75_250/clusteredPoints/ -o kmeans_output_08_01/output_kmeans_08_01_04_75_250.txt *(my question is I started with 11 centroid points , but here I see there are 12 Cls and VLs including the CL-0 cluster , I am not able to understand , why this is resulting to 12 clusters when I only started with 11 clusters , and also I read somewhere on the forum that CL-0 and VL-0 is having some difference dealing with whether the points converged or not.I shall be thankful if someone can help me out interpreting the result. I tried the same with varying the cluster centers and the number of cluster centers , each time I get an extra cluster center including the cluster-0 , also I see a large chunk of my points always lies in the cluster-0 cluster compared to the other clusters ).* CL-0{n=245243 c=[186:1.979, 189:1.773, 212:2.412] r=[186:3.736, 189:3.471, 212:4.223]} CL-1{n=7719 c=[186:23.854, 189:22.358, 212:19.325] r=[186:14.610, 189:15.440, 212:18.432]} CL-10{n=15 c=[186:302.333, 189:122.733, 212:26.000] r=[186:61.250, 189:64.750, 212:43.909]} CL-11{n=44 c=[186:176.568, 189:112.955, 212:191.932] r=[186:41.269, 189:37.998, 212:49.329]} CL-2{n=43 c=[186:26.047, 189:196.395, 212:72.767] r=[186:29.565, 189:56.176, 212:61.852]} CL-4{n=179 c=[186:123.464, 189:32.682, 212:9.380] r=[186:41.883, 189:24.915, 212:20.090]} CL-5{n=273 c=[186:102.070, 189:102.099, 212:52.630] r=[186:37.351, 189:35.809, 212:40.499]} CL-6{n=84 c=[186:193.810, 189:207.940, 212:171.179] r=[186:35.053, 189:39.931, 212:32.354]} CL-8{n=113 c=[186:6.894, 189:8.841, 212:204.876] r=[186:20.501, 189:21.619, 212:70.855]} CL-9{n=25 c=[186:286.000, 189:238.440, 212:280.120] r=[186:54.655, 189:57.882, 212:48.926]} CL-7{n=8 c=[186:422.125, 189:350.750, 212:316.875] r=[186:48.486, 189:50.630, 212:59.742]} [abanerjee@m0002006 ~]$ cat kmeans_output_08_01/output_kmeans_08_01_05_50_175.txt | grep VL- VL-3{n=10 c=[186:378.800, 189:355.600, 212:12.600] r=[186:88.654, 189:61.557, 212:19.454]} Thanks all for your help . Abhik Jeff Eastman <jeastman <at> Narus.com> writes: > > It is possible for kmeans to fail to assign any points to one of its clusters, and I believe this is the > explanation for your seeing only 4 clusters when you requested 5. Also, looking at your clustered data, > the points are printed out in sparse vector notation (index:value) and your input vectors only appear to > have 2 or 3 nonzero elements. This could be the reason for dropping one of the clusters. > > -----Original Message----- > From: Abhik Banerjee [mailto:banerjee.abhik.hcl <at> gmail.com] > Sent: Friday, July 29, 2011 3:48 PM > To: user <at> mahout.apache.org > Subject: Analyzing the clusterdump output - kmeans clustering > > Hi, > > I managed to run the kmeans algorithm on a cloudera vm , using the > help provided at the wiki and help at the forum . I got my output and > am trying to use the clusterdump to analyze my result. > > (I seemed to give 5 iterations , but it seems to have formed only 4 > clusters , I am also curious about that , I ran this below command ) > > mahout kmeans -i hdfs://localhost/mahout_input/ip -o > hdfs://localhost/mahout_output/output_kmeans_07_29_1/ -dm > org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 1.0 -c > hdfs://localhost/mahout_input/centroids_07_29_1 -k 5 -x 5 -cl > > after k means completion on hadoop cloudera vm I ran this command :- > > mahout clusterdump --seqFileDir > hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusters-5/part-r-00000 > --pointsDir hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusteredPoints > --output kmeans_07_29_1_cl5.tx > > and when I look into the text file I see a structure like this > > CL-99871{n=10157 c=[186:12.229, 189:9.343, 212:2.716] r=[186:7.803, 189:8.054, 2 > 12:4.686]} > Weight: Point: > 1.0: 1.161.199.19 = [186:22.000, 189:32.000] > 1.0: 1.161.204.226 = [186:9.000, 189:11.000] > 1.0: 1.170.149.79 = [186:18.000, 189:10.000] > 1.0: 1.175.137.84 = [186:23.000, 189:8.000] > 1.0: 1.176.27.109 = [186:7.000, 189:9.000, 212:3.000] > 1.0: 1.177.175.26 = [186:12.000, 189:12.000] > 1.0: 1.197.208.25 = [186:26.000] > 1.0: 1.212.176.27 = [186:11.000, 189:1.000] > 1.0: 1.212.176.28 = [186:11.000, 189:6.000] > 1.0: 1.22.160.35 = [186:17.000, 189:6.000] > 1.0: 1.230.123.81 = [186:18.000, 189:4.000] > > I can figure the first part of it , as explained in the wiki , that > the name is CL-99871 , number of points is 10157 , cluster center is [ > ] in the vector form , radius is [ ] , > > I dont understand how the later part of it is structured , the Ip > addresses are my name - data points which I wanted to get clustered, > what do those vector values mean , if they mean the vectors of those > points , I am not sure why they are only 2 dimensional as my original > data points were consisting of 288 dimensions , for each ip address. > > Thanks for all the help, > Abhik > >
