Are you running seq2sparse in there somewhere? It has a -nv option that will produce NamedVectors in its vector output. These will pass through the clustering and be evident in the clusterdump output.

On 4/18/12 3:08 PM, Robert Stewart wrote:
I am running kmeans clustering on vectors extracted from a lucene index.

What I want as my end result is a mapping of document ID to the cluster for 
each document.  How can I get that output?  I see many other people also want 
this but I dont see enough detail in any solution that helps me enough to get 
it.

So far I do this:

./mahout lucene.vector -d ~/clusterdemo/solr/data/index/ -f text --idField id 
--output output.txt --dictOut dict.txt

./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters clusters -cl

./mahout clusterdump --dictionary dict.txt --seqFileDir 
kmeans/clusters-10-final --dictionaryType text --pointsDir 
kmeans/clusteredPoints --output dump

But what I see inside "dump" file does not contain any mapping from document ID 
to each cluster.  How can I get that?  Should not be this hard to get the most 
obvious/useful output IMO ;)

Thanks
Bob




Reply via email to