Re: Kmeans clusterdump Interpretation

Andrew Musselman Mon, 20 Jul 2015 19:03:57 -0700

It's possible you could write a post-processing step to find the closest
point to the centroid based on the "distance" property if I'm recalling it
correctly.


On Mon, Jul 20, 2015 at 6:45 PM, Ankit Goel <[email protected]> wrote:

> That kind of puts me in a tough position. I was planning to use kmeans as a
> method for aggregating similar articles from multiple news sources, and
> then getting a representative article from those. Here I mean similar as in
> the articles are from different news sources but are about the exact same
> thing. Intuitively it seems that these articles would get grouped
> together. Any suggestions how I should go about that? So far I'm using
> nutch to crawl, solr to index and now I'm here on mahout.
>
> On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning <[email protected]>
> wrote:
>
> > The most central point in a cluster is often referred to as a medoid
> > (similar to median, but multi-dimensional).
> >
> > The Mahout code does not compute medoids.  In general, they are difficult
> > to compute and implementing a full k-medoid clustering algorithm even
> more
> > so.
> >
> >
> >
> > On Mon, Jul 20, 2015 at 6:25 PM, Ankit Goel <[email protected]>
> > wrote:
> >
> > > Oh, I thought kmeans gave me a point vector as a centroid, not a
> > calculated
> > > point central to a cluster. I guess in this case I would be looking for
> > the
> > > most central point vector (from the index ) that I can use as a
> > > representative of the cluster.
> > >
> > > On Tue, Jul 21, 2015 at 6:41 AM, Andrew Musselman <
> > > [email protected]> wrote:
> > >
> > > > I'm not sure centroid id is even a defined thing, especially since
> the
> > > > centroid, in my understanding, is just a point in space, not
> > necessarily
> > > a
> > > > point in your data.
> > > >
> > > > Are you trying to find the most-central point in a given cluster?
> > > >
> > > > On Mon, Jul 20, 2015 at 5:18 PM, Ankit Goel <[email protected]
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > I've been messing with mahout 0.10 and kmeans clustering with a
> solr
> > > > 4.6.1
> > > > > index. The data is news articles. The --field option for kmeans is
> > set
> > > to
> > > > > "content". The idField is set to "title" (just so i can analyse it
> > > > faster).
> > > > > The clusterdump of the kmeans result gives me a proper output, but
> I
> > > cant
> > > > > figure out the id of the vector chosen as the center. There are
> only
> > > > 14-15
> > > > > articles so I am not hung up about the cluster performance at this
> > > time.
> > > > >
> > > > > I used random seeds for the kmeans commandline.
> > > > > For reference, this is the commandline cluster dump I am executing
> > > > >
> > > > > bin/mahout clusterdump -i $MAHOUT_HOME/testCluster/clusters-3-final
> > > > > -p $MAHOUT_HOME/testCluster/clusteredPoints -d
> $MAHOUT_HOME/dict.txt
> > > -b 5
> > > > >
> > > > > The output I get is off the form
> > > > >
> > > > > :{"r":
> > > > >
> > > > > top terms
> > > > >
> > > > > xxxxx==>xxxxx
> > > > >
> > > > > Weight : [props - optional]:  Point:
> > > > >
> > > > >  1.0 : [distance=0.0]: [{"account":0.026}.......other features]
> > > > >
> > > > > 1.0 : [distance=0.3963903651622338]: [....]
> > > > >
> > > > >
> > > > > So how exactly do I get the centroid id? I have even tried
> accessing
> > it
> > > > > with java
> > > > >
> > > > > ClusterWritable value.getValue().getCenter() but this just gives me
> > the
> > > > > features and values of the centroid.
> > > > >
> > > > > Also, please do explain the meaning of "account":0.026 (just making
> > > sure
> > > > I
> > > > > know it right). I used tfidf.
> > > > >
> > > > > --
> > > > > Regards,
> > > > > Ankit Goel
> > > > > http://about.me/ankitgoel
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Ankit Goel
> > > http://about.me/ankitgoel
> > >
> >
>
>
>
> --
> Regards,
> Ankit Goel
> http://about.me/ankitgoel
>

Re: Kmeans clusterdump Interpretation

Reply via email to