Mahesh, I guess this is what I get for working too long and not recognizing the diff . Suspected it was something silly.
Changing the driver parameters to EXACTLY the same as the command line does indeed work. Thank you. I now have one file. Not sure if it was the convergence or the sequential, but I have a hunch that the problem was the sequential (As you pointed out, I have plenty of iterations left). Cheers! SCott On 1/6/14 3:58 AM, "Mahesh Balija" <[email protected]> wrote: >Hi Scott, > >Not very sure why you are getting many part files in code execution, the >difference b/w in your command line and the code execution is your cd >[Convergence Delta] is different 0.1 and 0.01, in the later case KMeans >might take more iterations to converge as its convergenceDelta is very >less >but anyways you have number of iterations set to 10. >Another difference is you are running your source code execution in >sequential mode. I am not sure whether these factors really effect the >number of part files being generated. > >Anyhow you have to evaluate the number of clusters being generated finally >by using ClusterDumper in both the cases, that will give you the number of >clusters and the points associated with each clusters. > >The ClusteredPoints will be generated in the last iteration and will have >the info about the clusters and associated points for each cluster. > >Best, >Mahesh Balija. > > >On Sun, Jan 5, 2014 at 1:59 AM, Scott C. Cote <[email protected]> >wrote: > >> All, >> >> When I run the Kmeans analysis from the command line, >> >> > # >> > # added the -cd option per instructions in the Mahout In Action (MiA) >>so >> the >> > convergance threhsold is .1 >> > # instead of default value of .5 because cosines lie within 0 >>and >> 1. >> > # >> > # maximum number of iterations is 10 >> > # >> > mahout kmeans -i reuters-vectors/tfidf-vectors/ -c >> > reuters-canopy-centroids/clusters-0-final/ -cl -ow -o >> reuters-kmeans-clusters >> > -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd >>0.1 >> >> the iterations resolve to a directory with the word "final" that has a >> single file where the name is like "part-r-00000" . >> If I run it as a java routine: >> >> KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids, >> "clusters-0-final"), clusterOutput, >> >> new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true); >> >> >> >> thousands of files such as "part-00338" are produced. The same data >>is >> used as input for both and both are initialized from canopy . >> >> Why does the command line form generate a single file while my Java >>version >> generate multiple output files. What setting/configuration am I >>missing? >> >> Secondary question: The sequence files located in the "final" folder I >> assume to contain the "centroids" of the data (and the points that the >> centroids were derived from are in the "clusteredPoints" (please >>confirm). >> >> Thanks in advance. >> >> SCott >> >> >> >> >>
