Using whatever you used originally would be best. A map-reduce program will be slow for small batches, of course. I don't know if seq2sparse has an efficient sequential mode.
On Thu, May 12, 2011 at 11:18 AM, Frank Scholten <[email protected]>wrote: > What do you recommend for vectorizing the new docs? Run seq2sparse on > a batch of them? Seems there's no code at the moment for quickly > vectorizing a few new documents based on the existing dictionary. > > Frank > > On Thu, May 12, 2011 at 12:32 PM, Grant Ingersoll <[email protected]> > wrote: > > From what I've seen, using Mahout's existing clustering methods, I think > most people setup some schedule whereby they cluster the whole collection on > a regular basis and then all docs that come in the meantime are simply > assigned to the closest cluster until the next whole collection iteration is > completed. There are, of course, other variants one could do, such as kick > off the whole clustering when some threshold of number of docs is reached. > > > > There are other clustering methods, as Benson alluded to, that may better > support incremental approaches. > > > > On May 12, 2011, at 4:53 AM, David Saile wrote: > > > >> I am still stuck at this problem. > >> > >> Can anyone give me a heads-up on how existing systems handle this? > >> If a collection of documents is modified, is the clustering recomputed > from scratch each time? > >> Or is there in fact any incremental way to handle an evolving set of > documents? > >> > >> I would really appreciate any hint! > >> > >> Thanks, > >> David > >> > >> > >> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck: > >> > >>> Not an answer, but a follow-up question: > >>> I would be interested in the very same thing, but with the possibility > to assign new sites to existing clusters OR to new ones. > >>> > >>> Thanks in advance, > >>> Ulrich > >>> > >>> -----Ursprüngliche Nachricht----- > >>> Von: David Saile [mailto:[email protected]] > >>> Gesendet: Montag, 9. Mai 2011 11:53 > >>> An: [email protected] > >>> Betreff: Incremental clustering > >>> > >>> Hi list, > >>> > >>> I am completely new to Mahout, so please forgive me if the answer to my > question is too obvious. > >>> > >>> For a case study, I am working on a simple incremental web crawler > (much like Nutch) and I want to include a very simple indexing step that > incorporates clustering of documents. > >>> > >>> I was hoping to use some kind of incremental clustering algorithm, in > order to make use of the incremental way the crawler is supposed to work > (i.e. continuously adding and updating websites). > >>> > >>> Is there some way to achieve the following: > >>> 1) initial clustering of the first web-crawl > >>> 2) assigning new sites to existing clusters > >>> 3) possibly moving modified sites between clusters > >>> > >>> I would really appreciate any help! > >>> > >>> Thanks, > >>> David > >> > > > > -------------------------- > > Grant Ingersoll > > http://www.lucidimagination.com/ > > > > Search the Lucene ecosystem docs using Solr/Lucene: > > http://www.lucidimagination.com/search > > > > >
