Hi , i have used mahout to produce kmeans clustering for my tf-idf result. I use the mahout command line to produce the clusters and it seems it successfully completes.
$MAHOUT_HOME/bin/mahout kmeans -i ./tfidf-vectors -c ./initialclusters -o ./kmeans-clusters -cd 1.0 -k 3 -x 1000 It seems there are two clusters directory generated.(cluster-1 and cluster-2) , when i use clusterdump on each of them, it seems to me that the clustered top terms are the same. Any idea why ? Also, how can i see which documents have been assigned to each cluster. Right now, i can see the number of documents assigned but not the complete list. Most importantly, for production purposes, i assume it makes sense for kmeans always runs on hadoop to generate the clustering file. But how do i consume these during serving ? Ideally, serving should have the doc id or query passed as a query, and the server should return the top document ranked by the score within the same cluster back. How do I do it in code ? Any good examples ? Thanks a lot, Weide
