When you specify -k to k-means it will randomly sample k values from your input data set to use as the initial cluster centers. Those clusters will be written to the -c directory and form the prior of the k-means iteration. Each iteration will then produce a revised set of clusters in clusters-x; from your clusters-5 dump it looks like the computation has not yet converged (CL-21551 has not converged whereas VL-21560 has). If you are running this against Reuters, I suggest your initial -k value is too low and you need to increase the --maxIter value to obtain convergence.

In terms of importing into Weka, I suggest the cluster dumper probably won't be of much use and suggest you write your own job to convert the clusters into a format you can use more directly.


On 8/30/10 11:20 AM, Valerio Ceraudo wrote:
In the folder clusters what I used there is this file: part-randomSeed


created by the command:

bin/mahout kmeans -i /home/vuvvo/reuters-out-seqdir-sparse/tfidf-vectors/ -c
/home/vuvvo/clusters -o /home/vuvvo/reuters-kmeans -k 3 --maxIter 5

I need to use the files in the folder reuters-kmeans? inside it I have got some
other sub-directory called cluster-x where x is from 1 to 5.

I tried to give the cluster-5 as input and inside finalOutput I have got a file
big 1,4 Mb but very hard to open also with 4 mb of ram on a 64 bit ^^

I can read the first and the second row:
CL-21551 {n=1855 c =[1:0.011,2:0.005...to 31:0.012
and the second row:
VL-21560{n=19722 c[0:0.012 etc etc...

is now converged and correct?

is there a more comfortable way to read this file?because than i need to convert
it in data for weka.

This night i will try to convert an irff data in a sgm to see what i obtain with
it.





Reply via email to