Right now the best in terms of speed without losing quality in Mahout is the streaming k-means implementation.
One exciting possibility is that you probably can combine a streaming k-means pre-pass with a regularized k-means algorithm in order to get results more like Lingo. You could also follow with a DP-means pass to get an idea of optimal number of clusters. The idea with streaming k-means is that a first pass does a rough clustering into a whole lot of clusters. This pass is fast because only approximate search is needed. It is also adaptive so you only have to specify very roughly how many clusters you will probably be interested in having later. The output is an approximate k-means clustering with many more clusters than you asked for. This output can then be clustered in memory using any weighted clustering algorithm you care to use. For k-means and certain kinds of data, you can even get nice probabilistic accuracy bounds for the combo. On Tue, Sep 17, 2013 at 12:06 PM, Mike Hugo <[email protected]> wrote: > Hello, > > I'm new to mahout but have been working with Solr, Carrot2 and clustering > documents with the Lingo algorithm. This has worked well for us for > clustering small sets of search results, but we are now branching out into > wanting to cluster larger sets of documents (millions of documents to 10s > of millions of document for now). > > Could someone point me in the right direction as to which of the clustering > algorithms I should take a look at first (that would be similar to Lingo)? > > Thanks, > > Mike >
