HI,

I am still having a few issues in interpreting the result from the kmeans
cluster dump output, I shall be thankful if someone can help me out with it.
I first used the canopy clustering which gave me 11 centroid points and then
I used the file for kmeans mahout code to generate the output folder using
this code ( I ran it for 8 iterations ,

--------mahout kmeans -i /var/tmp/input_kmeans_08_01_01/ipfull -o
/var/tmp/output_kmeans_08_01_03_50_200/ -dm
org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.01 -c
hdfs:/var/tmp/output_canopy_08_01_03_50_200/clusters-0/ -cl -x 8

which gave me a folder with cluster-0 to clusters-8 and a folder with the
name clusteredPoints, after which I used the cliusterdump command and got
the following result.

-------- mahout clusterdump --seqFileDir
hdfs:/var/tmp/output_kmeans_08_01_04_75_250/clusters-8/part-r-00000
--pointsDir hdfs:/var/tmp/output_kmeans_08_01_04_75_250/clusteredPoints/ -o
kmeans_output_08_01/output_kmeans_08_01_04_75_250.txt

*(my question is I started with 11 centroid points , but here I see there
are 12 Cls and VLs including the CL-0 cluster , I am not able to understand
, why this is resulting to 12 clusters when I only started with 11 clusters
, and also I read somewhere on the forum that CL-0 and VL-0 is having some
difference dealing with whether the points converged or not.I shall be
thankful if someone can help me out interpreting the result. I tried the
same with varying the cluster centers and the number of cluster centers ,
each time I get an extra cluster center including the cluster-0 , also I see
a large chunk of my points always lies in the cluster-0 cluster compared to
the other clusters ).*

CL-0{n=245243 c=[186:1.979, 189:1.773, 212:2.412] r=[186:3.736, 189:3.471,
212:4.223]}
CL-1{n=7719 c=[186:23.854, 189:22.358, 212:19.325] r=[186:14.610,
189:15.440, 212:18.432]}
CL-10{n=15 c=[186:302.333, 189:122.733, 212:26.000] r=[186:61.250,
189:64.750, 212:43.909]}
CL-11{n=44 c=[186:176.568, 189:112.955, 212:191.932] r=[186:41.269,
189:37.998, 212:49.329]}
CL-2{n=43 c=[186:26.047, 189:196.395, 212:72.767] r=[186:29.565, 189:56.176,
212:61.852]}
CL-4{n=179 c=[186:123.464, 189:32.682, 212:9.380] r=[186:41.883, 189:24.915,
212:20.090]}
CL-5{n=273 c=[186:102.070, 189:102.099, 212:52.630] r=[186:37.351,
189:35.809, 212:40.499]}
CL-6{n=84 c=[186:193.810, 189:207.940, 212:171.179] r=[186:35.053,
189:39.931, 212:32.354]}
CL-8{n=113 c=[186:6.894, 189:8.841, 212:204.876] r=[186:20.501, 189:21.619,
212:70.855]}
CL-9{n=25 c=[186:286.000, 189:238.440, 212:280.120] r=[186:54.655,
189:57.882, 212:48.926]}
CL-7{n=8 c=[186:422.125, 189:350.750, 212:316.875] r=[186:48.486,
189:50.630, 212:59.742]}
[abanerjee@m0002006 ~]$ cat
kmeans_output_08_01/output_kmeans_08_01_05_50_175.txt | grep VL-
VL-3{n=10 c=[186:378.800, 189:355.600, 212:12.600] r=[186:88.654,
189:61.557, 212:19.454]}

Thanks all for your help .

Abhik

Jeff Eastman <jeastman <at> Narus.com> writes:

>
> It is possible for kmeans to fail to assign any points to one of its
clusters, and I believe this is the
> explanation for your seeing only 4 clusters when you requested 5. Also,
looking at your clustered data,
> the points are printed out in sparse vector notation (index:value) and
your input vectors only appear to
> have 2 or 3 nonzero elements. This could be the reason for dropping one of
the clusters.
>
> -----Original Message-----
> From: Abhik Banerjee [mailto:banerjee.abhik.hcl <at> gmail.com]
> Sent: Friday, July 29, 2011 3:48 PM
> To: user <at> mahout.apache.org
> Subject: Analyzing the clusterdump output - kmeans clustering
>
> Hi,
>
> I managed to run the kmeans algorithm on a cloudera vm , using the
> help provided at the wiki and help at the forum . I got my output and
> am trying to use the clusterdump to analyze my result.
>
>  (I seemed to give 5 iterations , but it seems to have formed only 4
> clusters , I am also curious about that , I ran this below command )
>
> mahout kmeans -i hdfs://localhost/mahout_input/ip -o
> hdfs://localhost/mahout_output/output_kmeans_07_29_1/ -dm
> org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 1.0 -c
> hdfs://localhost/mahout_input/centroids_07_29_1 -k 5 -x 5 -cl
>
> after k means completion on hadoop cloudera vm I ran this command :-
>
> mahout clusterdump --seqFileDir
>
hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusters-5/part-r-00000
> --pointsDir
hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusteredPoints
> --output kmeans_07_29_1_cl5.tx
>
> and when I look into the text file I see a structure like this
>
> CL-99871{n=10157 c=[186:12.229, 189:9.343, 212:2.716] r=[186:7.803,
189:8.054, 2
> 12:4.686]}
> Weight:  Point:
> 1.0: 1.161.199.19 = [186:22.000, 189:32.000]
> 1.0: 1.161.204.226 = [186:9.000, 189:11.000]
> 1.0: 1.170.149.79 = [186:18.000, 189:10.000]
> 1.0: 1.175.137.84 = [186:23.000, 189:8.000]
> 1.0: 1.176.27.109 = [186:7.000, 189:9.000, 212:3.000]
> 1.0: 1.177.175.26 = [186:12.000, 189:12.000]
> 1.0: 1.197.208.25 = [186:26.000]
> 1.0: 1.212.176.27 = [186:11.000, 189:1.000]
> 1.0: 1.212.176.28 = [186:11.000, 189:6.000]
> 1.0: 1.22.160.35 = [186:17.000, 189:6.000]
> 1.0: 1.230.123.81 = [186:18.000, 189:4.000]
>
> I can figure the first part of it , as explained in the wiki , that
> the name is CL-99871 , number of points is 10157 , cluster center is [
> ] in the vector form , radius is [ ] ,
>
> I dont understand how the later part of it is structured , the Ip
> addresses are my name - data points which I wanted to get clustered,
> what do those vector values mean , if they mean the vectors of those
> points , I am not sure why they are only 2 dimensional as my original
> data points were consisting of 288 dimensions , for each ip address.
>
> Thanks for all the help,
> Abhik
>
>

Reply via email to