Jeff, Could you expand a bit on the subject of models in clustering? I mentally simplify this into 'clustering: unsupervised; classification: supervised.'
Is the idea here that you are going to be presented with many different corpora that have some sort of overall resemblance, so that priors derived from the first N speed up clustering N+1? --benson On Thu, May 12, 2011 at 12:00 PM, Jeff Eastman <[email protected]> wrote: > Sure, by using your old clusters as the prior (clustersIn) for the new > clustering, you can reduce the number of iterations required to converge. > > -----Original Message----- > From: David Saile [mailto:[email protected]] > Sent: Thursday, May 12, 2011 8:54 AM > To: [email protected] > Subject: Re: AW: Incremental clustering > > Thank you very much everyone! This really helped a lot. > > Here is what I am planning to do: > I am going to compute an initial clustering after the first crawl. > Then, as sites are being added to the index I will simply classify them using > the existing clusters. > > As I expect updates to be generally very small, I will only recompute the > clustering after some threshold has been hit, like Grant suggested. > As Ted pointed out, this can be done with the old clusters as input. > > Thanks again, > David > > > > Am 12.05.2011 um 17:35 schrieb Ted Dunning: > >> Most of these algorithms can be done in an incremental fashion in which you >> can add batches to the previous training. >> >> On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <[email protected]> wrote: >> >>> Most of the clustering drivers have two methods: one to train the clusterer >>> with data to produce the cluster models; one to classify the data using a >>> given set of cluster models. Currently the CLI only allows train followed by >>> optional classify. We could pretty easily allow classify to be done >>> stand-alone, and this would be useful in support of Grant's approach below. >>> >>> Jeff >>> >>> -----Original Message----- >>> From: Grant Ingersoll [mailto:[email protected]] >>> Sent: Thursday, May 12, 2011 3:32 AM >>> To: [email protected] >>> Subject: Re: AW: Incremental clustering >>> >>> From what I've seen, using Mahout's existing clustering methods, I think >>> most people setup some schedule whereby they cluster the whole collection on >>> a regular basis and then all docs that come in the meantime are simply >>> assigned to the closest cluster until the next whole collection iteration is >>> completed. There are, of course, other variants one could do, such as kick >>> off the whole clustering when some threshold of number of docs is reached. >>> >>> There are other clustering methods, as Benson alluded to, that may better >>> support incremental approaches. >>> >>> On May 12, 2011, at 4:53 AM, David Saile wrote: >>> >>>> I am still stuck at this problem. >>>> >>>> Can anyone give me a heads-up on how existing systems handle this? >>>> If a collection of documents is modified, is the clustering recomputed >>> from scratch each time? >>>> Or is there in fact any incremental way to handle an evolving set of >>> documents? >>>> >>>> I would really appreciate any hint! >>>> >>>> Thanks, >>>> David >>>> >>>> >>>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck: >>>> >>>>> Not an answer, but a follow-up question: >>>>> I would be interested in the very same thing, but with the possibility >>> to assign new sites to existing clusters OR to new ones. >>>>> >>>>> Thanks in advance, >>>>> Ulrich >>>>> >>>>> -----Ursprüngliche Nachricht----- >>>>> Von: David Saile [mailto:[email protected]] >>>>> Gesendet: Montag, 9. Mai 2011 11:53 >>>>> An: [email protected] >>>>> Betreff: Incremental clustering >>>>> >>>>> Hi list, >>>>> >>>>> I am completely new to Mahout, so please forgive me if the answer to my >>> question is too obvious. >>>>> >>>>> For a case study, I am working on a simple incremental web crawler (much >>> like Nutch) and I want to include a very simple indexing step that >>> incorporates clustering of documents. >>>>> >>>>> I was hoping to use some kind of incremental clustering algorithm, in >>> order to make use of the incremental way the crawler is supposed to work >>> (i.e. continuously adding and updating websites). >>>>> >>>>> Is there some way to achieve the following: >>>>> 1) initial clustering of the first web-crawl >>>>> 2) assigning new sites to existing clusters >>>>> 3) possibly moving modified sites between clusters >>>>> >>>>> I would really appreciate any help! >>>>> >>>>> Thanks, >>>>> David >>>> >>> >>> -------------------------- >>> Grant Ingersoll >>> http://www.lucidimagination.com/ >>> >>> Search the Lucene ecosystem docs using Solr/Lucene: >>> http://www.lucidimagination.com/search >>> >>> > >
