Re: need help explaining difference in k means output

Scott C. Cote Mon, 06 Jan 2014 06:54:57 -0800

Mahesh,

I guess this is what I get for working too long and not recognizing the
diff .  Suspected it was something silly.


Changing the driver parameters to EXACTLY the same as the command line
does indeed work.   Thank you.

I now have one file.  Not sure if it was the convergence or the
sequential, but I have a hunch that the problem was the sequential (As you
pointed out, I have plenty of iterations left).

Cheers!

SCott

On 1/6/14 3:58 AM, "Mahesh Balija" <[email protected]> wrote:

>Hi Scott,
>
>Not very sure why you are getting many part files in code execution, the
>difference b/w in your command line and the code execution is your cd
>[Convergence Delta] is different 0.1 and 0.01, in the later case KMeans
>might take more iterations to converge as its convergenceDelta is very
>less
>but anyways you have number of iterations set to 10.
>Another difference is you are running your source code execution in
>sequential mode. I am not sure whether these factors really effect the
>number of part files being generated.
>
>Anyhow you have to evaluate the number of clusters being generated finally
>by using ClusterDumper in both the cases, that will give you the number of
>clusters and the points associated with each clusters.
>
>The ClusteredPoints will be generated in the last iteration and will have
>the info about the clusters and associated points for each cluster.
>
>Best,
>Mahesh Balija.
>
>
>On Sun, Jan 5, 2014 at 1:59 AM, Scott C. Cote <[email protected]>
>wrote:
>
>> All,
>>
>> When I run the Kmeans analysis from the command line,
>>
>> > #
>> > # added the -cd option per instructions in the Mahout In Action (MiA)
>>so
>> the
>> > convergance threhsold is .1
>> > #       instead of default value of .5  because cosines lie within 0
>>and
>> 1.
>> > #
>> > # maximum number of iterations is 10
>> > #
>> > mahout kmeans -i reuters-vectors/tfidf-vectors/ -c
>> > reuters-canopy-centroids/clusters-0-final/ -cl -ow -o
>> reuters-kmeans-clusters
>> > -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure -cd
>>0.1
>>
>>  the iterations resolve to a directory with the word "final" that has a
>> single file where the name is like "part-r-00000"  .
>>  If I run it as a java routine:
>>
>> KMeansDriver.run(conf, vectorsFolder, new Path(canopyCentroids,
>> "clusters-0-final"), clusterOutput,
>>
>> new CosineDistanceMeasure(), 0.01, 20, true, 0.0, true);
>>
>>
>>
>>  thousands of files such as "part-00338"  are produced.  The same data
>>is
>> used as input for both and both are initialized from canopy .
>>
>> Why does the command line form generate a single file while my Java
>>version
>> generate multiple output files.  What setting/configuration am I
>>missing?
>>
>> Secondary question:  The sequence files located in the "final" folder I
>> assume to contain the "centroids" of the data (and the points that the
>> centroids were derived from are in the "clusteredPoints" (please
>>confirm).
>>
>> Thanks in advance.
>>
>> SCott
>>
>>
>>
>>
>>

Re: need help explaining difference in k means output

Reply via email to