Kmeans will not create new clusters beyond what it is given in the prior. When 
you ran canopy I suspect you actually got 12 clusters there too (run 
clusterdump on clusters-0 to verify). The CL clusters have not yet converged to 
VL clusters, so you either need a larger -cd value or you need to increase the 
number of iterations (-x) so they will converge. The fact that most of your 
points ended up in CL-0 indicates they were mostly alike. That's what 
clustering does :).

I'm a little suspicious of your input data. You claim it has 288 dimensions, 
but all of your clusters show centroid values in the 186, 189 and 212 element 
values only. This suggests to me that all of the 285 other element values are 
zero. I'd recheck my input vectors to make sure you are really getting the sort 
of input you desire.

Other than this, it looks like you are on the right track.

-----Original Message-----
From: Abhik Banerjee [mailto:[email protected]] 
Sent: Tuesday, August 02, 2011 10:01 AM
To: [email protected]
Subject: Re: Analyzing the clusterdump output - kmeans clustering

HI,

I am still having a few issues in interpreting the result from the kmeans
cluster dump output, I shall be thankful if someone can help me out with it.
I first used the canopy clustering which gave me 11 centroid points and then
I used the file for kmeans mahout code to generate the output folder using
this code ( I ran it for 8 iterations ,

--------mahout kmeans -i /var/tmp/input_kmeans_08_01_01/ipfull -o
/var/tmp/output_kmeans_08_01_03_50_200/ -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.01 -c
hdfs:/var/tmp/output_canopy_08_01_03_50_200/clusters-0/ -cl -x 8

which gave me a folder with cluster-0 to clusters-8 and a folder with the
name clusteredPoints, after which I used the cliusterdump command and got
the following result.

-------- mahout clusterdump --seqFileDir
hdfs:/var/tmp/output_kmeans_08_01_04_75_250/clusters-8/part-r-00000
--pointsDir hdfs:/var/tmp/output_kmeans_08_01_04_75_250/clusteredPoints/ -o
kmeans_output_08_01/output_kmeans_08_01_04_75_250.txt

*(my question is I started with 11 centroid points , but here I see there
are 12 Cls and VLs including the CL-0 cluster , I am not able to understand
, why this is resulting to 12 clusters when I only started with 11 clusters
, and also I read somewhere on the forum that CL-0 and VL-0 is having some
difference dealing with whether the points converged or not.I shall be
thankful if someone can help me out interpreting the result. I tried the
same with varying the cluster centers and the number of cluster centers ,
each time I get an extra cluster center including the cluster-0 , also I see
a large chunk of my points always lies in the cluster-0 cluster compared to
the other clusters ).*

CL-0{n=245243 c=[186:1.979, 189:1.773, 212:2.412] r=[186:3.736, 189:3.471,
212:4.223]}
CL-1{n=7719 c=[186:23.854, 189:22.358, 212:19.325] r=[186:14.610,
189:15.440, 212:18.432]}
CL-10{n=15 c=[186:302.333, 189:122.733, 212:26.000] r=[186:61.250,
189:64.750, 212:43.909]}
CL-11{n=44 c=[186:176.568, 189:112.955, 212:191.932] r=[186:41.269,
189:37.998, 212:49.329]}
CL-2{n=43 c=[186:26.047, 189:196.395, 212:72.767] r=[186:29.565, 189:56.176,
212:61.852]}
CL-4{n=179 c=[186:123.464, 189:32.682, 212:9.380] r=[186:41.883, 189:24.915,
212:20.090]}
CL-5{n=273 c=[186:102.070, 189:102.099, 212:52.630] r=[186:37.351,
189:35.809, 212:40.499]}
CL-6{n=84 c=[186:193.810, 189:207.940, 212:171.179] r=[186:35.053,
189:39.931, 212:32.354]}
CL-8{n=113 c=[186:6.894, 189:8.841, 212:204.876] r=[186:20.501, 189:21.619,
212:70.855]}
CL-9{n=25 c=[186:286.000, 189:238.440, 212:280.120] r=[186:54.655,
189:57.882, 212:48.926]}
CL-7{n=8 c=[186:422.125, 189:350.750, 212:316.875] r=[186:48.486,
189:50.630, 212:59.742]}
[abanerjee@m0002006 ~]$ cat
kmeans_output_08_01/output_kmeans_08_01_05_50_175.txt | grep VL-
VL-3{n=10 c=[186:378.800, 189:355.600, 212:12.600] r=[186:88.654,
189:61.557, 212:19.454]}

Thanks all for your help .

Abhik

Jeff Eastman <jeastman <at> Narus.com> writes:

>
> It is possible for kmeans to fail to assign any points to one of its
clusters, and I believe this is the
> explanation for your seeing only 4 clusters when you requested 5. Also,
looking at your clustered data,
> the points are printed out in sparse vector notation (index:value) and
your input vectors only appear to
> have 2 or 3 nonzero elements. This could be the reason for dropping one of
the clusters.
>
> -----Original Message-----
> From: Abhik Banerjee [mailto:banerjee.abhik.hcl <at> gmail.com]
> Sent: Friday, July 29, 2011 3:48 PM
> To: user <at> mahout.apache.org
> Subject: Analyzing the clusterdump output - kmeans clustering
>
> Hi,
>
> I managed to run the kmeans algorithm on a cloudera vm , using the
> help provided at the wiki and help at the forum . I got my output and
> am trying to use the clusterdump to analyze my result.
>
>  (I seemed to give 5 iterations , but it seems to have formed only 4
> clusters , I am also curious about that , I ran this below command )
>
> mahout kmeans -i hdfs://localhost/mahout_input/ip -o
> hdfs://localhost/mahout_output/output_kmeans_07_29_1/ -dm
> org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 1.0 -c
> hdfs://localhost/mahout_input/centroids_07_29_1 -k 5 -x 5 -cl
>
> after k means completion on hadoop cloudera vm I ran this command :-
>
> mahout clusterdump --seqFileDir
>
hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusters-5/part-r-00000
> --pointsDir
hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusteredPoints
> --output kmeans_07_29_1_cl5.tx
>
> and when I look into the text file I see a structure like this
>
> CL-99871{n=10157 c=[186:12.229, 189:9.343, 212:2.716] r=[186:7.803,
189:8.054, 2
> 12:4.686]}
> Weight:  Point:
> 1.0: 1.161.199.19 = [186:22.000, 189:32.000]
> 1.0: 1.161.204.226 = [186:9.000, 189:11.000]
> 1.0: 1.170.149.79 = [186:18.000, 189:10.000]
> 1.0: 1.175.137.84 = [186:23.000, 189:8.000]
> 1.0: 1.176.27.109 = [186:7.000, 189:9.000, 212:3.000]
> 1.0: 1.177.175.26 = [186:12.000, 189:12.000]
> 1.0: 1.197.208.25 = [186:26.000]
> 1.0: 1.212.176.27 = [186:11.000, 189:1.000]
> 1.0: 1.212.176.28 = [186:11.000, 189:6.000]
> 1.0: 1.22.160.35 = [186:17.000, 189:6.000]
> 1.0: 1.230.123.81 = [186:18.000, 189:4.000]
>
> I can figure the first part of it , as explained in the wiki , that
> the name is CL-99871 , number of points is 10157 , cluster center is [
> ] in the vector form , radius is [ ] ,
>
> I dont understand how the later part of it is structured , the Ip
> addresses are my name - data points which I wanted to get clustered,
> what do those vector values mean , if they mean the vectors of those
> points , I am not sure why they are only 2 dimensional as my original
> data points were consisting of 288 dimensions , for each ip address.
>
> Thanks for all the help,
> Abhik
>
>

Reply via email to