Re: Clustering algorithms

Ted Dunning Tue, 17 Sep 2013 13:27:55 -0700

Right now the best in terms of speed without losing quality in Mahout is
the streaming k-means implementation.

One exciting possibility is that you probably can combine a streaming
k-means pre-pass with a regularized k-means algorithm in order to get
results more like Lingo.  You could also follow with a DP-means pass to get
an idea of optimal number of clusters.

The idea with streaming k-means is that a first pass does a rough
clustering into a whole lot of clusters.  This pass is fast because only
approximate search is needed.  It is also adaptive so you only have to
specify very roughly how many clusters you will probably be interested in
having later.  The output is an approximate k-means clustering with many
more clusters than you asked for.  This output can then be clustered in
memory using any weighted clustering algorithm you care to use.  For
k-means and certain kinds of data, you can even get nice probabilistic
accuracy bounds for the combo.

On Tue, Sep 17, 2013 at 12:06 PM, Mike Hugo <[email protected]> wrote:

> Hello,
>
> I'm new to mahout but have been working with Solr, Carrot2 and clustering
> documents with the Lingo algorithm.  This has worked well for us for
> clustering small sets of search results, but we are now branching out into
> wanting to cluster larger sets of documents (millions of documents to 10s
> of millions of document for now).
>
> Could someone point me in the right direction as to which of the clustering
> algorithms I should take a look at first (that would be similar to Lingo)?
>
> Thanks,
>
> Mike
>

Re: Clustering algorithms

Reply via email to