All,

When I run the Kmeans analysis from the command line,

> #
> # added the -cd option per instructions in the Mahout In Action (MiA) so the
> convergance threhsold is .1
> #       instead of default value of .5  because cosines lie within 0 and 1.
> #
> # maximum number of iterations is 10
> #
> mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
> reuters-canopy-centroids/clusters-0-final/ -cl -ow -o reuters-kmeans-clusters
> -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1

 the iterations resolve to a directory with the word "final" that has a
single file where the name is like "part-r-00000"  .
 If I run it as a java routine:

KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
"clusters-0-final"), clusterOutput,

new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true);



 thousands of files such as "part-00338"  are produced.  The same data is
used as input for both and both are initialized from canopy .

Why does the command line form generate a single file while my Java version
generate multiple output files.  What setting/configuration am I missing?

Secondary question:  The sequence files located in the "final" folder I
assume to contain the "centroids" of the data (and the points that the
centroids were derived from are in the "clusteredPoints" (please confirm).

Thanks in advance.

SCott




Reply via email to