Re: why so hard to get doc->cluster mapping?

Jeff Eastman Wed, 18 Apr 2012 14:53:43 -0700

Are you running seq2sparse in there somewhere? It has a -nv option thatwill produce NamedVectors in its vector output. These will pass throughthe clustering and be evident in the clusterdump output.


On 4/18/12 3:08 PM, Robert Stewart wrote:

I am running kmeans clustering on vectors extracted from a lucene index.


What I want as my end result is a mapping of document ID to the cluster for 
each document.  How can I get that output?  I see many other people also want 
this but I dont see enough detail in any solution that helps me enough to get 
it.

So far I do this:

./mahout lucene.vector -d ~/clusterdemo/solr/data/index/ -f text --idField id 
--output output.txt --dictOut dict.txt

./mahout kmeans -i output.txt -o kmeans -x 10 -k 100 -ow --clusters clusters -cl

./mahout clusterdump --dictionary dict.txt --seqFileDir 
kmeans/clusters-10-final --dictionaryType text --pointsDir 
kmeans/clusteredPoints --output dump

But what I see inside "dump" file does not contain any mapping from document ID 
to each cluster.  How can I get that?  Should not be this hard to get the most 
obvious/useful output IMO ;)

Thanks
Bob

Re: why so hard to get doc->cluster mapping?

Reply via email to