On Oct 2, 2011, at 11:52 PM, Walter Chang wrote: > Hi , > > i have used mahout to produce kmeans clustering for my tf-idf result. I use > the mahout command line to produce the clusters and it seems it successfully > completes. > > $MAHOUT_HOME/bin/mahout kmeans -i ./tfidf-vectors -c ./initialclusters -o > ./kmeans-clusters -cd 1.0 -k 3 -x 1000 > > It seems there are two clusters directory generated.(cluster-1 and > cluster-2) , when i use clusterdump on each of them, it seems to me that > the clustered top terms are the same. Any idea why ?
The top terms are exactly that, the top terms. It is not all of the terms. My guess is that things don't change much between the two iterations. > > Also, how can i see which documents have been assigned to each cluster. > Right now, i can see the number of documents assigned but not the complete > list. Add the --clustering flag. By default, K-Means just calculates the centroids. If you want to know membership, the --clustering flag does that. > > Most importantly, for production purposes, i assume it makes sense for > kmeans always runs on hadoop to generate the clustering file. But how do i > consume these during serving ? Ideally, serving should have the doc id or > query passed as a query, and the server should return the top document > ranked by the score within the same cluster back. How do I do it in code ? > Any good examples ? Presumably, you have to load up the centroids and/or the results and see which cluster the new item belongs to. > > Thanks a lot, > > Weide -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
