Have a look at the ClusterDumper (bin/mahout clusterdump --help -- that should give you an idea of how to run it)
The main output contains the centroids. The clustered points dir contains all of the original points and what cluster they belong to along with distance. The ClusterDumper can marry these two. On Oct 8, 2011, at 12:28 PM, Walter Chang wrote: > Hi Grant, > > I added clustering flag for kmeans. Now i see the an output dir called > ClusteredPoints. However, when i use sequential file dump, what does each > column mean ? How do i associate with the original doc ? The output is not > straightforward to me. > > Input Path: part-m-00000 > Key class: class org.apache.hadoop.io.IntWritable Value Class: class > org.apache.mahout.clustering.WeightedPropertyVectorWritable > Key: 0: Value: wt: 1.0distance: 7.811960616283556 vec: [0:1.000, 11:1.847, > 12:2.253, 14:1.847] > Key: 0: Value: wt: 1.0distance: 10.856925385759745 vec: [0:1.000, 5:2.253, > 8:2.253, 11:1.847, 14:1.847] > Key: 0: Value: wt: 1.0distance: 10.174423474410343 vec: [0:1.000, 6:1.847, > 15:2.253, 16:2.253] > Key: 0: Value: wt: 1.0distance: 4.766995846807366 vec: [0:1.000, 6:1.847, > 17:1.847] > Key: 0: Value: wt: 1.0distance: 7.129458704934154 vec: [0:1.000, 10:2.253, > 17:1.847] > Key: 5: Value: wt: 1.0distance: 0.0 vec: [0:1.000, 3:1.847, 7:1.847, > 9:1.847, 13:1.847] > Key: 6: Value: wt: 1.0distance: 0.0 vec: [1:2.253, 3:1.847, 7:1.847, > 9:1.847, 13:1.847] > > Thanks, > > Weide > > > On Thu, Oct 6, 2011 at 11:54 AM, Grant Ingersoll <[email protected]>wrote: > >> >> On Oct 2, 2011, at 11:52 PM, Walter Chang wrote: >> >>> Hi , >>> >>> i have used mahout to produce kmeans clustering for my tf-idf result. I >> use >>> the mahout command line to produce the clusters and it seems it >> successfully >>> completes. >>> >>> $MAHOUT_HOME/bin/mahout kmeans -i ./tfidf-vectors -c ./initialclusters >> -o >>> ./kmeans-clusters -cd 1.0 -k 3 -x 1000 >>> >>> It seems there are two clusters directory generated.(cluster-1 and >>> cluster-2) , when i use clusterdump on each of them, it seems to me that >>> the clustered top terms are the same. Any idea why ? >> >> The top terms are exactly that, the top terms. It is not all of the terms. >> My guess is that things don't change much between the two iterations. >> >>> >>> Also, how can i see which documents have been assigned to each cluster. >>> Right now, i can see the number of documents assigned but not the >> complete >>> list. >> >> Add the --clustering flag. By default, K-Means just calculates the >> centroids. If you want to know membership, the --clustering flag does that. >> >>> >>> Most importantly, for production purposes, i assume it makes sense for >>> kmeans always runs on hadoop to generate the clustering file. But how do >> i >>> consume these during serving ? Ideally, serving should have the doc id or >>> query passed as a query, and the server should return the top document >>> ranked by the score within the same cluster back. How do I do it in code >> ? >>> Any good examples ? >> >> Presumably, you have to load up the centroids and/or the results and see >> which cluster the new item belongs to. >> >> >> >>> >>> Thanks a lot, >>> >>> Weide >> >> -------------------------------------------- >> Grant Ingersoll >> http://www.lucidimagination.com >> Lucene Eurocon 2011: http://www.lucene-eurocon.com >> >> -------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com
