You can always just pick the article closest to the centroid. But I think that you may find that with simple k-means that clusters are going to be about more than one thing.
On Mon, Jul 20, 2015 at 8:21 PM, Ankit Goel <[email protected]> wrote: > Hmm, kmeans algorithmically is supposed to only annoint existing > vectors(documents) as the centroid for a cluster every step (or so I > believe). If mahout is generating non document vector as a centroid, it > changes a lot of things. > > That would also explain the -distanceMeasure option in clusterdump. As > Andrew mentions, running clusterdump with the default euclidean measure > should give me the closest document vector to the calculated centroid. > Please correct me if I'm wrong anywhere. > Thanks > > On Tue, Jul 21, 2015 at 7:33 AM, Andrew Musselman < > [email protected]> wrote: > > > It's possible you could write a post-processing step to find the closest > > point to the centroid based on the "distance" property if I'm recalling > it > > correctly. > > > > On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel <[email protected]> > > wrote: > > > > > That kind of puts me in a tough position. I was planning to use kmeans > > as a > > > method for aggregating similar articles from multiple news sources, and > > > then getting a representative article from those. Here I mean similar > as > > in > > > the articles are from different news sources but are about the exact > same > > > thing. Intuitively it seems that these articles would get grouped > > > together. Any suggestions how I should go about that? So far I'm using > > > nutch to crawl, solr to index and now I'm here on mahout. > > > > > > On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning <[email protected]> > > > wrote: > > > > > > > The most central point in a cluster is often referred to as a medoid > > > > (similar to median, but multi-dimensional). > > > > > > > > The Mahout code does not compute medoids. In general, they are > > difficult > > > > to compute and implementing a full k-medoid clustering algorithm even > > > more > > > > so. > > > > > > > > > > > > > > > > On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <[email protected] > > > > > > wrote: > > > > > > > > > Oh, I thought kmeans gave me a point vector as a centroid, not a > > > > calculated > > > > > point central to a cluster. I guess in this case I would be looking > > for > > > > the > > > > > most central point vector (from the index ) that I can use as a > > > > > representative of the cluster. > > > > > > > > > > On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman < > > > > > [email protected]> wrote: > > > > > > > > > > > I'm not sure centroid id is even a defined thing, especially > since > > > the > > > > > > centroid, in my understanding, is just a point in space, not > > > > necessarily > > > > > a > > > > > > point in your data. > > > > > > > > > > > > Are you trying to find the most-central point in a given cluster? > > > > > > > > > > > > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel < > > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi, > > > > > > > I've been messing with mahout 0.10 and kmeans clustering with a > > > solr > > > > > > 4.6.1 > > > > > > > index. The data is news articles. The --field option for kmeans > > is > > > > set > > > > > to > > > > > > > "content". The idField is set to "title" (just so i can analyse > > it > > > > > > faster). > > > > > > > The clusterdump of the kmeans result gives me a proper output, > > but > > > I > > > > > cant > > > > > > > figure out the id of the vector chosen as the center. There are > > > only > > > > > > 14-15 > > > > > > > articles so I am not hung up about the cluster performance at > > this > > > > > time. > > > > > > > > > > > > > > I used random seeds for the kmeans commandline. > > > > > > > For reference, this is the commandline cluster dump I am > > executing > > > > > > > > > > > > > > bin/mahout clusterdump -i > > $MAHOUT_HOME/testCluster/clusters-3-final > > > > > > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d > > > $MAHOUT_HOME/dict.txt > > > > > -b 5 > > > > > > > > > > > > > > The output I get is off the form > > > > > > > > > > > > > > :{"r": > > > > > > > > > > > > > > top terms > > > > > > > > > > > > > > xxxxx==>xxxxx > > > > > > > > > > > > > > Weight : [props - optional]: Point: > > > > > > > > > > > > > > 1.0 : [distance=0.0]: [{"account":0.026}.......other features] > > > > > > > > > > > > > > 1.0 : [distance=0.3963903651622338]: [....] > > > > > > > > > > > > > > > > > > > > > So how exactly do I get the centroid id? I have even tried > > > accessing > > > > it > > > > > > > with java > > > > > > > > > > > > > > ClusterWritable value.getValue().getCenter() but this just > gives > > me > > > > the > > > > > > > features and values of the centroid. > > > > > > > > > > > > > > Also, please do explain the meaning of "account":0.026 (just > > making > > > > > sure > > > > > > I > > > > > > > know it right). I used tfidf. > > > > > > > > > > > > > > -- > > > > > > > Regards, > > > > > > > Ankit Goel > > > > > > > http://about.me/ankitgoel > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Regards, > > > > > Ankit Goel > > > > > http://about.me/ankitgoel > > > > > > > > > > > > > > > > > > > > > -- > > > Regards, > > > Ankit Goel > > > http://about.me/ankitgoel > > > > > > > > > -- > Regards, > Ankit Goel > http://about.me/ankitgoel >
