RE: Analyzing the clusterdump output - kmeans clustering

Jeff Eastman Tue, 02 Aug 2011 13:19:31 -0700

It is possible with kmeans to get a situation where the centers oscillate and 
do not converge. With your dataset this may be the case. I suggest continuing 
to get your input data into proper shape and revisit this if it occurs again.


-----Original Message-----
From: Abhik Banerjee [mailto:[email protected]] 
Sent: Tuesday, August 02, 2011 11:54 AM
To: [email protected]
Subject: Re: Analyzing the clusterdump output - kmeans clustering

I tried increasing the value of -cd option , from 0.01 to 0.5 then to 0.8
and also increased the number of iterations value from 15 to 20 the to 40 ,
I still get the same number of CL- and VL- clusters and they do not seem to
converge, does it mean that the best possible clusters I can get are these
and they cannot be converged beyond this point.

I shall be thankful for all the help.

Abhik

Abhik Banerjee <banerjee.abhik.hcl <at> gmail.com> writes:

>
> Hi Jeff,
>
> Thanks a lot , I have been waiting for the reply. I feel you are right , I
> did not notice that the clusterdump output from the canopy also has a
> cluster-0 in its output file , so that way this is right , and ok I feel
you
> are also right about the points gathering in the cluster-0 , because thats
> the way probably the data should behave .
>
> I am having around 288 dimensions , but only around 2-3 of them have
> non-zero values , rest are 0 , so the input data is fine , thats the input
> file I generated using map reduce  ( this is just a prototype and the
final
> version might have more non negative values for most of the cells).
>
> Next thing is I shall change the iteration count value and shall increase
> the -cd value option so that they converge, but now when I check , I see
all
> my clustedump output for the canopy clusters give n=0 in the cluster
> results:-
>
> clusterdump --seqFileDir
> hdfs:/var/tmp/output_canopy_08_01_06_75_200/clusters-0/part-r-00000
> --pointsDir hdfs:/var/tmp/output_canopy_08_01_06_75_200/clusteredPoints/
> --output canopy_output_08_01/output_canopy_08_01_06_75_200.txt
>
> C-0{n=1 c=[186:2.570, 189:2.311, 212:2.844] r=[]}
> C-1{n=1 c=[186:10.125, 189:258.500, 212:3.125] r=[]}
> C-2{n=1 c=[186:483.500, 189:416.000, 212:25.500] r=[]}
> C-3{n=1 c=[186:309.000, 189:31.500] r=[]}
> C-4{n=1 c=[186:208.500, 189:227.050, 212:187.800] r=[]}
> C-5{n=1 c=[186:2.426, 189:2.754, 212:209.115] r=[]}
> C-6{n=1 c=[186:471.500, 189:403.000, 212:260.000] r=[]}
> C-7{n=1 c=[186:287.250, 189:299.500] r=[]}
> C-8{n=1 c=[186:311.000, 189:327.000, 212:382.000] r=[]}
> C-9{n=1 c=[186:227.000, 189:98.000, 212:356.000] r=[]}
>
> I feel I am doing something wrong in the clusterdump method arguments ,
but
> am not able to pinpoint it . (because kmeans clusterdump with the same
> arguments , give the proper output for n = values .. )
>
> Thanks again for all your help.
>
> Abhik
>
> Jeff Eastman <jeastman <at> Narus.com> writes:
>
> >
> > Kmeans will not create new clusters beyond what it is given in the
prior.
> When you ran canopy I suspect you
> > actually got 12 clusters there too (run clusterdump on clusters-0 to
> verify). The CL clusters have not yet
> > converged to VL clusters, so you either need a larger -cd value or you
> need to increase the number of
> > iterations (-x) so they will converge. The fact that most of your points
> ended up in CL-0 indicates they
> > were mostly alike. That's what clustering does :).
> >
> > I'm a little suspicious of your input data. You claim it has 288
> dimensions, but all of your clusters show
> > centroid values in the 186, 189 and 212 element values only. This
suggests
> to me that all of the 285 other
> > element values are zero. I'd recheck my input vectors to make sure you
are
> really getting the sort of input
> > you desire.
> >
> > Other than this, it looks like you are on the right track.
> >
> > -----Original Message-----
> > From: Abhik Banerjee [mailto:banerjee.abhik.hcl <at> gmail.com]
> > Sent: Tuesday, August 02, 2011 10:01 AM
> > To: user <at> mahout.apache.org
> > Subject: Re: Analyzing the clusterdump output - kmeans clustering
> >
> > HI,
> >
> > I am still having a few issues in interpreting the result from the
kmeans
> > cluster dump output, I shall be thankful if someone can help me out with
> it.
> > I first used the canopy clustering which gave me 11 centroid points and
> then
> > I used the file for kmeans mahout code to generate the output folder
using
> > this code ( I ran it for 8 iterations ,
> >
> > --------mahout kmeans -i /var/tmp/input_kmeans_08_01_01/ipfull -o
> > /var/tmp/output_kmeans_08_01_03_50_200/ -dm
> > org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 0.01 -c
> > hdfs:/var/tmp/output_canopy_08_01_03_50_200/clusters-0/ -cl -x 8
> >
> > which gave me a folder with cluster-0 to clusters-8 and a folder with
the
> > name clusteredPoints, after which I used the cliusterdump command and
got
> > the following result.
> >
> > -------- mahout clusterdump --seqFileDir
> > hdfs:/var/tmp/output_kmeans_08_01_04_75_250/clusters-8/part-r-00000
> > --pointsDir hdfs:/var/tmp/output_kmeans_08_01_04_75_250/clusteredPoints/
> -o
> > kmeans_output_08_01/output_kmeans_08_01_04_75_250.txt
> >
> > *(my question is I started with 11 centroid points , but here I see
there
> > are 12 Cls and VLs including the CL-0 cluster , I am not able to
> understand
> > , why this is resulting to 12 clusters when I only started with 11
> clusters
> > , and also I read somewhere on the forum that CL-0 and VL-0 is having
some
> > difference dealing with whether the points converged or not.I shall be
> > thankful if someone can help me out interpreting the result. I tried the
> > same with varying the cluster centers and the number of cluster centers
,
> > each time I get an extra cluster center including the cluster-0 , also I
> see
> > a large chunk of my points always lies in the cluster-0 cluster compared
> to
> > the other clusters ).*
> >
> > CL-0{n=245243 c=[186:1.979, 189:1.773, 212:2.412] r=[186:3.736,
189:3.471,
> > 212:4.223]}
> > CL-1{n=7719 c=[186:23.854, 189:22.358, 212:19.325] r=[186:14.610,
> > 189:15.440, 212:18.432]}
> > CL-10{n=15 c=[186:302.333, 189:122.733, 212:26.000] r=[186:61.250,
> > 189:64.750, 212:43.909]}
> > CL-11{n=44 c=[186:176.568, 189:112.955, 212:191.932] r=[186:41.269,
> > 189:37.998, 212:49.329]}
> > CL-2{n=43 c=[186:26.047, 189:196.395, 212:72.767] r=[186:29.565,
> 189:56.176,
> > 212:61.852]}
> > CL-4{n=179 c=[186:123.464, 189:32.682, 212:9.380] r=[186:41.883,
> 189:24.915,
> > 212:20.090]}
> > CL-5{n=273 c=[186:102.070, 189:102.099, 212:52.630] r=[186:37.351,
> > 189:35.809, 212:40.499]}
> > CL-6{n=84 c=[186:193.810, 189:207.940, 212:171.179] r=[186:35.053,
> > 189:39.931, 212:32.354]}
> > CL-8{n=113 c=[186:6.894, 189:8.841, 212:204.876] r=[186:20.501,
> 189:21.619,
> > 212:70.855]}
> > CL-9{n=25 c=[186:286.000, 189:238.440, 212:280.120] r=[186:54.655,
> > 189:57.882, 212:48.926]}
> > CL-7{n=8 c=[186:422.125, 189:350.750, 212:316.875] r=[186:48.486,
> > 189:50.630, 212:59.742]}
> > [abanerjee <at> m0002006 ~]$ cat
> > kmeans_output_08_01/output_kmeans_08_01_05_50_175.txt | grep VL-
> > VL-3{n=10 c=[186:378.800, 189:355.600, 212:12.600] r=[186:88.654,
> > 189:61.557, 212:19.454]}
> >
> > Thanks all for your help .
> >
> > Abhik
> >
> > Jeff Eastman <jeastman <at> Narus.com> writes:
> >
> > >
> > > It is possible for kmeans to fail to assign any points to one of its
> > clusters, and I believe this is the
> > > explanation for your seeing only 4 clusters when you requested 5.
Also,
> > looking at your clustered data,
> > > the points are printed out in sparse vector notation (index:value) and
> > your input vectors only appear to
> > > have 2 or 3 nonzero elements. This could be the reason for dropping
one
> of
> > the clusters.
> > >
> > > -----Original Message-----
> > > From: Abhik Banerjee [mailto:banerjee.abhik.hcl <at> gmail.com]
> > > Sent: Friday, July 29, 2011 3:48 PM
> > > To: user <at> mahout.apache.org
> > > Subject: Analyzing the clusterdump output - kmeans clustering
> > >
> > > Hi,
> > >
> > > I managed to run the kmeans algorithm on a cloudera vm , using the
> > > help provided at the wiki and help at the forum . I got my output and
> > > am trying to use the clusterdump to analyze my result.
> > >
> > >  (I seemed to give 5 iterations , but it seems to have formed only 4
> > > clusters , I am also curious about that , I ran this below command )
> > >
> > > mahout kmeans -i hdfs://localhost/mahout_input/ip -o
> > > hdfs://localhost/mahout_output/output_kmeans_07_29_1/ -dm
> > > org.apache.mahout.common.distance.EuclideanDistanceMeasure -cd 1.0 -c
> > > hdfs://localhost/mahout_input/centroids_07_29_1 -k 5 -x 5 -cl
> > >
> > > after k means completion on hadoop cloudera vm I ran this command :-
> > >
> > > mahout clusterdump --seqFileDir
> > >
> >
>
hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusters-5/part-r-00000
> > > --pointsDir
> > hdfs://localhost/mahout_output/output_kmeans_07_29_1/clusteredPoints
> > > --output kmeans_07_29_1_cl5.tx
> > >
> > > and when I look into the text file I see a structure like this
> > >
> > > CL-99871{n=10157 c=[186:12.229, 189:9.343, 212:2.716] r=[186:7.803,
> > 189:8.054, 2
> > > 12:4.686]}
> > > Weight:  Point:
> > > 1.0: 1.161.199.19 = [186:22.000, 189:32.000]
> > > 1.0: 1.161.204.226 = [186:9.000, 189:11.000]
> > > 1.0: 1.170.149.79 = [186:18.000, 189:10.000]
> > > 1.0: 1.175.137.84 = [186:23.000, 189:8.000]
> > > 1.0: 1.176.27.109 = [186:7.000, 189:9.000, 212:3.000]
> > > 1.0: 1.177.175.26 = [186:12.000, 189:12.000]
> > > 1.0: 1.197.208.25 = [186:26.000]
> > > 1.0: 1.212.176.27 = [186:11.000, 189:1.000]
> > > 1.0: 1.212.176.28 = [186:11.000, 189:6.000]
> > > 1.0: 1.22.160.35 = [186:17.000, 189:6.000]
> > > 1.0: 1.230.123.81 = [186:18.000, 189:4.000]
> > >
> > > I can figure the first part of it , as explained in the wiki , that
> > > the name is CL-99871 , number of points is 10157 , cluster center is [
> > > ] in the vector form , radius is [ ] ,
> > >
> > > I dont understand how the later part of it is structured , the Ip
> > > addresses are my name - data points which I wanted to get clustered,
> > > what do those vector values mean , if they mean the vectors of those
> > > points , I am not sure why they are only 2 dimensional as my original
> > > data points were consisting of 288 dimensions , for each ip address.
> > >
> > > Thanks for all the help,
> > > Abhik
> > >
> > >
> >
> >
>

RE: Analyzing the clusterdump output - kmeans clustering

Reply via email to