Re: need help explaining difference in k means output

Mahesh Balija Mon, 06 Jan 2014 01:59:58 -0800

Hi Scott,

Not very sure why you are getting many part files in code execution, the
difference b/w in your command line and the code execution is your cd
[Convergence Delta] is different 0.1 and 0.01, in the later case KMeans
might take more iterations to converge as its convergenceDelta is very less
but anyways you have number of iterations set to 10.
Another difference is you are running your source code execution in
sequential mode. I am not sure whether these factors really effect the
number of part files being generated.


Anyhow you have to evaluate the number of clusters being generated finally
by using ClusterDumper in both the cases, that will give you the number of
clusters and the points associated with each clusters.

The ClusteredPoints will be generated in the last iteration and will have
the info about the clusters and associated points for each cluster.

Best,
Mahesh Balija.


On Sun, Jan 5, 2014 at 1:59 AM, Scott C. Cote <[email protected]> wrote:

> All,
>
> When I run the Kmeans analysis from the command line,
>
> > #
> > # added the -cd option per instructions in the Mahout In Action (MiA) so
> the
> > convergance threhsold is .1
> > #       instead of default value of .5  because cosines lie within 0 and
> 1.
> > #
> > # maximum number of iterations is 10
> > #
> > mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
> > reuters-canopy-centroids/clusters-0-final/ -cl -ow -o
> reuters-kmeans-clusters
> > -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1
>
>  the iterations resolve to a directory with the word "final" that has a
> single file where the name is like "part-r-00000"  .
>  If I run it as a java routine:
>
> KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
> "clusters-0-final"), clusterOutput,
>
> new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true);
>
>
>
>  thousands of files such as "part-00338"  are produced.  The same data is
> used as input for both and both are initialized from canopy .
>
> Why does the command line form generate a single file while my Java version
> generate multiple output files.  What setting/configuration am I missing?
>
> Secondary question:  The sequence files located in the "final" folder I
> assume to contain the "centroids" of the data (and the points that the
> centroids were derived from are in the "clusteredPoints" (please confirm).
>
> Thanks in advance.
>
> SCott
>
>
>
>
>

Re: need help explaining difference in k means output

Reply via email to