Re: Clustering algorithms

Mike Hugo Tue, 17 Sep 2013 13:13:08 -0700

Thanks Ted!


On Tue, Sep 17, 2013 at 2:59 PM, Ted Dunning <[email protected]> wrote:

> Right now the best in terms of speed without losing quality in Mahout is
> the streaming k-means implementation.
>
> One exciting possibility is that you probably can combine a streaming
> k-means pre-pass with a regularized k-means algorithm in order to get
> results more like Lingo.  You could also follow with a DP-means pass to get
> an idea of optimal number of clusters.
>
> The idea with streaming k-means is that a first pass does a rough
> clustering into a whole lot of clusters.  This pass is fast because only
> approximate search is needed.  It is also adaptive so you only have to
> specify very roughly how many clusters you will probably be interested in
> having later.  The output is an approximate k-means clustering with many
> more clusters than you asked for.  This output can then be clustered in
> memory using any weighted clustering algorithm you care to use.  For
> k-means and certain kinds of data, you can even get nice probabilistic
> accuracy bounds for the combo.
>
>
>
> On Tue, Sep 17, 2013 at 12:06 PM, Mike Hugo <[email protected]> wrote:
>
> > Hello,
> >
> > I'm new to mahout but have been working with Solr, Carrot2 and clustering
> > documents with the Lingo algorithm.  This has worked well for us for
> > clustering small sets of search results, but we are now branching out
> into
> > wanting to cluster larger sets of documents (millions of documents to 10s
> > of millions of document for now).
> >
> > Could someone point me in the right direction as to which of the
> clustering
> > algorithms I should take a look at first (that would be similar to
> Lingo)?
> >
> > Thanks,
> >
> > Mike
> >
>

Re: Clustering algorithms

Reply via email to