question about clustering

Walter Chang Sun, 02 Oct 2011 20:53:31 -0700

Hi ,

i have used mahout to produce kmeans  clustering for my tf-idf result. I use
the mahout command line to produce the clusters and it seems it successfully
completes.


$MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters -o
./kmeans-clusters  -cd 1.0 -k 3 -x 1000

It seems there are two clusters directory generated.(cluster-1 and
cluster-2)  , when i use clusterdump on each of them, it seems to me that
the clustered top terms are the same. Any idea why ?

Also, how can i see which documents have been assigned to each cluster.
Right now, i can see the number of documents assigned but not the complete
list.

Most importantly, for production purposes, i assume it makes sense for
kmeans always runs on hadoop to generate the clustering file. But how do i
consume these during serving ? Ideally, serving should have the doc id or
query passed as a query, and the server should return the top document
ranked by the score within the same cluster back. How do I do it in code ?
Any good examples ?

Thanks a lot,

Weide

question about clustering

Reply via email to