On Oct 2, 2011, at 11:52 PM, Walter Chang wrote:

> Hi ,
> 
> i have used mahout to produce kmeans  clustering for my tf-idf result. I use
> the mahout command line to produce the clusters and it seems it successfully
> completes.
> 
> $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters -o
> ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> 
> It seems there are two clusters directory generated.(cluster-1 and
> cluster-2)  , when i use clusterdump on each of them, it seems to me that
> the clustered top terms are the same. Any idea why ?

The top terms are exactly that, the top terms.  It is not all of the terms.  My 
guess is that things don't change much between the two iterations.

> 
> Also, how can i see which documents have been assigned to each cluster.
> Right now, i can see the number of documents assigned but not the complete
> list.

Add the --clustering flag.  By default, K-Means just calculates the centroids.  
If you want to know membership, the --clustering flag does that.

> 
> Most importantly, for production purposes, i assume it makes sense for
> kmeans always runs on hadoop to generate the clustering file. But how do i
> consume these during serving ? Ideally, serving should have the doc id or
> query passed as a query, and the server should return the top document
> ranked by the score within the same cluster back. How do I do it in code ?
> Any good examples ?

Presumably, you have to load up the centroids and/or the results and see which 
cluster the new item belongs to.



> 
> Thanks a lot,
> 
> Weide

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Reply via email to