Hi Grant, I added clustering flag for kmeans. Now i see the an output dir called ClusteredPoints. However, when i use sequential file dump, what does each column mean ? How do i associate with the original doc ? The output is not straightforward to me.
Input Path: part-m-00000 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.WeightedPropertyVectorWritable Key: 0: Value: wt: 1.0distance: 7.811960616283556 vec: [0:1.000, 11:1.847, 12:2.253, 14:1.847] Key: 0: Value: wt: 1.0distance: 10.856925385759745 vec: [0:1.000, 5:2.253, 8:2.253, 11:1.847, 14:1.847] Key: 0: Value: wt: 1.0distance: 10.174423474410343 vec: [0:1.000, 6:1.847, 15:2.253, 16:2.253] Key: 0: Value: wt: 1.0distance: 4.766995846807366 vec: [0:1.000, 6:1.847, 17:1.847] Key: 0: Value: wt: 1.0distance: 7.129458704934154 vec: [0:1.000, 10:2.253, 17:1.847] Key: 5: Value: wt: 1.0distance: 0.0 vec: [0:1.000, 3:1.847, 7:1.847, 9:1.847, 13:1.847] Key: 6: Value: wt: 1.0distance: 0.0 vec: [1:2.253, 3:1.847, 7:1.847, 9:1.847, 13:1.847] Thanks, Weide On Thu, Oct 6, 2011 at 11:54 AM, Grant Ingersoll <[email protected]>wrote: > > On Oct 2, 2011, at 11:52 PM, Walter Chang wrote: > > > Hi , > > > > i have used mahout to produce kmeans clustering for my tf-idf result. I > use > > the mahout command line to produce the clusters and it seems it > successfully > > completes. > > > > $MAHOUT_HOME/bin/mahout kmeans -i ./tfidf-vectors -c ./initialclusters > -o > > ./kmeans-clusters -cd 1.0 -k 3 -x 1000 > > > > It seems there are two clusters directory generated.(cluster-1 and > > cluster-2) , when i use clusterdump on each of them, it seems to me that > > the clustered top terms are the same. Any idea why ? > > The top terms are exactly that, the top terms. It is not all of the terms. > My guess is that things don't change much between the two iterations. > > > > > Also, how can i see which documents have been assigned to each cluster. > > Right now, i can see the number of documents assigned but not the > complete > > list. > > Add the --clustering flag. By default, K-Means just calculates the > centroids. If you want to know membership, the --clustering flag does that. > > > > > Most importantly, for production purposes, i assume it makes sense for > > kmeans always runs on hadoop to generate the clustering file. But how do > i > > consume these during serving ? Ideally, serving should have the doc id or > > query passed as a query, and the server should return the top document > > ranked by the score within the same cluster back. How do I do it in code > ? > > Any good examples ? > > Presumably, you have to load up the centroids and/or the results and see > which cluster the new item belongs to. > > > > > > > Thanks a lot, > > > > Weide > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > Lucene Eurocon 2011: http://www.lucene-eurocon.com > >
