What do you recommend for vectorizing the new docs? Run seq2sparse on a batch of them? Seems there's no code at the moment for quickly vectorizing a few new documents based on the existing dictionary.
Frank On Thu, May 12, 2011 at 12:32 PM, Grant Ingersoll <[email protected]> wrote: > From what I've seen, using Mahout's existing clustering methods, I think most > people setup some schedule whereby they cluster the whole collection on a > regular basis and then all docs that come in the meantime are simply assigned > to the closest cluster until the next whole collection iteration is > completed. There are, of course, other variants one could do, such as kick > off the whole clustering when some threshold of number of docs is reached. > > There are other clustering methods, as Benson alluded to, that may better > support incremental approaches. > > On May 12, 2011, at 4:53 AM, David Saile wrote: > >> I am still stuck at this problem. >> >> Can anyone give me a heads-up on how existing systems handle this? >> If a collection of documents is modified, is the clustering recomputed from >> scratch each time? >> Or is there in fact any incremental way to handle an evolving set of >> documents? >> >> I would really appreciate any hint! >> >> Thanks, >> David >> >> >> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck: >> >>> Not an answer, but a follow-up question: >>> I would be interested in the very same thing, but with the possibility to >>> assign new sites to existing clusters OR to new ones. >>> >>> Thanks in advance, >>> Ulrich >>> >>> -----Ursprüngliche Nachricht----- >>> Von: David Saile [mailto:[email protected]] >>> Gesendet: Montag, 9. Mai 2011 11:53 >>> An: [email protected] >>> Betreff: Incremental clustering >>> >>> Hi list, >>> >>> I am completely new to Mahout, so please forgive me if the answer to my >>> question is too obvious. >>> >>> For a case study, I am working on a simple incremental web crawler (much >>> like Nutch) and I want to include a very simple indexing step that >>> incorporates clustering of documents. >>> >>> I was hoping to use some kind of incremental clustering algorithm, in order >>> to make use of the incremental way the crawler is supposed to work (i.e. >>> continuously adding and updating websites). >>> >>> Is there some way to achieve the following: >>> 1) initial clustering of the first web-crawl >>> 2) assigning new sites to existing clusters >>> 3) possibly moving modified sites between clusters >>> >>> I would really appreciate any help! >>> >>> Thanks, >>> David >> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem docs using Solr/Lucene: > http://www.lucidimagination.com/search > >
