Thanks Ted!
On Tue, Sep 17, 2013 at 2:59 PM, Ted Dunning <[email protected]> wrote: > Right now the best in terms of speed without losing quality in Mahout is > the streaming k-means implementation. > > One exciting possibility is that you probably can combine a streaming > k-means pre-pass with a regularized k-means algorithm in order to get > results more like Lingo. You could also follow with a DP-means pass to get > an idea of optimal number of clusters. > > The idea with streaming k-means is that a first pass does a rough > clustering into a whole lot of clusters. This pass is fast because only > approximate search is needed. It is also adaptive so you only have to > specify very roughly how many clusters you will probably be interested in > having later. The output is an approximate k-means clustering with many > more clusters than you asked for. This output can then be clustered in > memory using any weighted clustering algorithm you care to use. For > k-means and certain kinds of data, you can even get nice probabilistic > accuracy bounds for the combo. > > > > On Tue, Sep 17, 2013 at 12:06 PM, Mike Hugo <[email protected]> wrote: > > > Hello, > > > > I'm new to mahout but have been working with Solr, Carrot2 and clustering > > documents with the Lingo algorithm. This has worked well for us for > > clustering small sets of search results, but we are now branching out > into > > wanting to cluster larger sets of documents (millions of documents to 10s > > of millions of document for now). > > > > Could someone point me in the right direction as to which of the > clustering > > algorithms I should take a look at first (that would be similar to > Lingo)? > > > > Thanks, > > > > Mike > > >
