All, When I run the Kmeans analysis from the command line,
> # > # added the -cd option per instructions in the Mahout In Action (MiA) so the > convergance threhsold is .1 > # instead of default value of .5 because cosines lie within 0 and 1. > # > # maximum number of iterations is 10 > # > mahout kmeans -i reuters-vectors/tfidf-vectors/ -c > reuters-canopy-centroids/clusters-0-final/ -cl -ow -o reuters-kmeans-clusters > -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd 0.1 the iterations resolve to a directory with the word "final" that has a single file where the name is like "part-r-00000" . If I run it as a java routine: KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, "clusters-0-final"), clusterOutput, new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true); thousands of files such as "part-00338" are produced. The same data is used as input for both and both are initialized from canopy . Why does the command line form generate a single file while my Java version generate multiple output files. What setting/configuration am I missing? Secondary question: The sequence files located in the "final" folder I assume to contain the "centroids" of the data (and the points that the centroids were derived from are in the "clusteredPoints" (please confirm). Thanks in advance. SCott
