Most of the clustering drivers have two methods: one to train the clusterer with data to produce the cluster models; one to classify the data using a given set of cluster models. Currently the CLI only allows train followed by optional classify. We could pretty easily allow classify to be done stand-alone, and this would be useful in support of Grant's approach below.
Jeff -----Original Message----- From: Grant Ingersoll [mailto:[email protected]] Sent: Thursday, May 12, 2011 3:32 AM To: [email protected] Subject: Re: AW: Incremental clustering >From what I've seen, using Mahout's existing clustering methods, I think most >people setup some schedule whereby they cluster the whole collection on a >regular basis and then all docs that come in the meantime are simply assigned >to the closest cluster until the next whole collection iteration is completed. > There are, of course, other variants one could do, such as kick off the whole >clustering when some threshold of number of docs is reached. There are other clustering methods, as Benson alluded to, that may better support incremental approaches. On May 12, 2011, at 4:53 AM, David Saile wrote: > I am still stuck at this problem. > > Can anyone give me a heads-up on how existing systems handle this? > If a collection of documents is modified, is the clustering recomputed from > scratch each time? > Or is there in fact any incremental way to handle an evolving set of > documents? > > I would really appreciate any hint! > > Thanks, > David > > > Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck: > >> Not an answer, but a follow-up question: >> I would be interested in the very same thing, but with the possibility to >> assign new sites to existing clusters OR to new ones. >> >> Thanks in advance, >> Ulrich >> >> -----Ursprüngliche Nachricht----- >> Von: David Saile [mailto:[email protected]] >> Gesendet: Montag, 9. Mai 2011 11:53 >> An: [email protected] >> Betreff: Incremental clustering >> >> Hi list, >> >> I am completely new to Mahout, so please forgive me if the answer to my >> question is too obvious. >> >> For a case study, I am working on a simple incremental web crawler (much >> like Nutch) and I want to include a very simple indexing step that >> incorporates clustering of documents. >> >> I was hoping to use some kind of incremental clustering algorithm, in order >> to make use of the incremental way the crawler is supposed to work (i.e. >> continuously adding and updating websites). >> >> Is there some way to achieve the following: >> 1) initial clustering of the first web-crawl >> 2) assigning new sites to existing clusters >> 3) possibly moving modified sites between clusters >> >> I would really appreciate any help! >> >> Thanks, >> David > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem docs using Solr/Lucene: http://www.lucidimagination.com/search
