Re: question about clustering

Walter Chang Sat, 08 Oct 2011 09:28:42 -0700

Hi Grant,

I added clustering flag for kmeans. Now i see the an output dir called
ClusteredPoints. However, when i use sequential file dump, what does each
column mean ? How do i associate with the original doc  ? The output is not
straightforward to me.


Input Path: part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.clustering.WeightedPropertyVectorWritable
Key: 0: Value: wt: 1.0distance: 7.811960616283556  vec: [0:1.000, 11:1.847,
12:2.253, 14:1.847]
Key: 0: Value: wt: 1.0distance: 10.856925385759745  vec: [0:1.000, 5:2.253,
8:2.253, 11:1.847, 14:1.847]
Key: 0: Value: wt: 1.0distance: 10.174423474410343  vec: [0:1.000, 6:1.847,
15:2.253, 16:2.253]
Key: 0: Value: wt: 1.0distance: 4.766995846807366  vec: [0:1.000, 6:1.847,
17:1.847]
Key: 0: Value: wt: 1.0distance: 7.129458704934154  vec: [0:1.000, 10:2.253,
17:1.847]
Key: 5: Value: wt: 1.0distance: 0.0  vec: [0:1.000, 3:1.847, 7:1.847,
9:1.847, 13:1.847]
Key: 6: Value: wt: 1.0distance: 0.0  vec: [1:2.253, 3:1.847, 7:1.847,
9:1.847, 13:1.847]

Thanks,

Weide


On Thu, Oct 6, 2011 at 11:54 AM, Grant Ingersoll <[email protected]>wrote:

>
> On Oct 2, 2011, at 11:52 PM, Walter Chang wrote:
>
> > Hi ,
> >
> > i have used mahout to produce kmeans  clustering for my tf-idf result. I
> use
> > the mahout command line to produce the clusters and it seems it
> successfully
> > completes.
> >
> > $MAHOUT_HOME/bin/mahout kmeans  -i ./tfidf-vectors -c ./initialclusters
> -o
> > ./kmeans-clusters  -cd 1.0 -k 3 -x 1000
> >
> > It seems there are two clusters directory generated.(cluster-1 and
> > cluster-2)  , when i use clusterdump on each of them, it seems to me that
> > the clustered top terms are the same. Any idea why ?
>
> The top terms are exactly that, the top terms.  It is not all of the terms.
>  My guess is that things don't change much between the two iterations.
>
> >
> > Also, how can i see which documents have been assigned to each cluster.
> > Right now, i can see the number of documents assigned but not the
> complete
> > list.
>
> Add the --clustering flag.  By default, K-Means just calculates the
> centroids.  If you want to know membership, the --clustering flag does that.
>
> >
> > Most importantly, for production purposes, i assume it makes sense for
> > kmeans always runs on hadoop to generate the clustering file. But how do
> i
> > consume these during serving ? Ideally, serving should have the doc id or
> > query passed as a query, and the server should return the top document
> > ranked by the score within the same cluster back. How do I do it in code
> ?
> > Any good examples ?
>
> Presumably, you have to load up the centroids and/or the results and see
> which cluster the new item belongs to.
>
>
>
> >
> > Thanks a lot,
> >
> > Weide
>
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
>
>

Re: question about clustering

Reply via email to