Most of these algorithms can be done in an incremental fashion in which you can add batches to the previous training.
On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <[email protected]> wrote: > Most of the clustering drivers have two methods: one to train the clusterer > with data to produce the cluster models; one to classify the data using a > given set of cluster models. Currently the CLI only allows train followed by > optional classify. We could pretty easily allow classify to be done > stand-alone, and this would be useful in support of Grant's approach below. > > Jeff > > -----Original Message----- > From: Grant Ingersoll [mailto:[email protected]] > Sent: Thursday, May 12, 2011 3:32 AM > To: [email protected] > Subject: Re: AW: Incremental clustering > > From what I've seen, using Mahout's existing clustering methods, I think > most people setup some schedule whereby they cluster the whole collection on > a regular basis and then all docs that come in the meantime are simply > assigned to the closest cluster until the next whole collection iteration is > completed. There are, of course, other variants one could do, such as kick > off the whole clustering when some threshold of number of docs is reached. > > There are other clustering methods, as Benson alluded to, that may better > support incremental approaches. > > On May 12, 2011, at 4:53 AM, David Saile wrote: > > > I am still stuck at this problem. > > > > Can anyone give me a heads-up on how existing systems handle this? > > If a collection of documents is modified, is the clustering recomputed > from scratch each time? > > Or is there in fact any incremental way to handle an evolving set of > documents? > > > > I would really appreciate any hint! > > > > Thanks, > > David > > > > > > Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck: > > > >> Not an answer, but a follow-up question: > >> I would be interested in the very same thing, but with the possibility > to assign new sites to existing clusters OR to new ones. > >> > >> Thanks in advance, > >> Ulrich > >> > >> -----Ursprüngliche Nachricht----- > >> Von: David Saile [mailto:[email protected]] > >> Gesendet: Montag, 9. Mai 2011 11:53 > >> An: [email protected] > >> Betreff: Incremental clustering > >> > >> Hi list, > >> > >> I am completely new to Mahout, so please forgive me if the answer to my > question is too obvious. > >> > >> For a case study, I am working on a simple incremental web crawler (much > like Nutch) and I want to include a very simple indexing step that > incorporates clustering of documents. > >> > >> I was hoping to use some kind of incremental clustering algorithm, in > order to make use of the incremental way the crawler is supposed to work > (i.e. continuously adding and updating websites). > >> > >> Is there some way to achieve the following: > >> 1) initial clustering of the first web-crawl > >> 2) assigning new sites to existing clusters > >> 3) possibly moving modified sites between clusters > >> > >> I would really appreciate any help! > >> > >> Thanks, > >> David > > > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem docs using Solr/Lucene: > http://www.lucidimagination.com/search > >
