Thank you very much everyone! This really helped a lot.

Here is what I am planning to do:
I am going to compute an initial clustering after the first crawl. 
Then, as sites are being added to the index I will simply classify them using 
the existing clusters.

As I expect updates to be generally very small, I will only recompute the 
clustering after some threshold has been hit, like Grant suggested. 
As Ted pointed out, this can be done with the old clusters as input.

Thanks again,
David


 
Am 12.05.2011 um 17:35 schrieb Ted Dunning:

> Most of these algorithms can be done in an incremental fashion in which you
> can add batches to the previous training.
> 
> On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <[email protected]> wrote:
> 
>> Most of the clustering drivers have two methods: one to train the clusterer
>> with data to produce the cluster models; one to classify the data using a
>> given set of cluster models. Currently the CLI only allows train followed by
>> optional classify. We could pretty easily allow classify to be done
>> stand-alone, and this would be useful in support of Grant's approach below.
>> 
>> Jeff
>> 
>> -----Original Message-----
>> From: Grant Ingersoll [mailto:[email protected]]
>> Sent: Thursday, May 12, 2011 3:32 AM
>> To: [email protected]
>> Subject: Re: AW: Incremental clustering
>> 
>> From what I've seen, using Mahout's existing clustering methods, I think
>> most people setup some schedule whereby they cluster the whole collection on
>> a regular basis and then all docs that come in the meantime are simply
>> assigned to the closest cluster until the next whole collection iteration is
>> completed.  There are, of course, other variants one could do, such as kick
>> off the whole clustering when some threshold of number of docs is reached.
>> 
>> There are other clustering methods, as Benson alluded to, that may better
>> support incremental approaches.
>> 
>> On May 12, 2011, at 4:53 AM, David Saile wrote:
>> 
>>> I am still stuck at this problem.
>>> 
>>> Can anyone give me a heads-up on how existing systems handle this?
>>> If a collection of documents is modified, is the clustering recomputed
>> from scratch each time?
>>> Or is there in fact any incremental way to handle an evolving set of
>> documents?
>>> 
>>> I would really appreciate any hint!
>>> 
>>> Thanks,
>>> David
>>> 
>>> 
>>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>> 
>>>> Not an answer, but a follow-up question:
>>>> I would be interested in the very same thing, but with the possibility
>> to assign new sites to existing clusters OR to new ones.
>>>> 
>>>> Thanks in advance,
>>>> Ulrich
>>>> 
>>>> -----Ursprüngliche Nachricht-----
>>>> Von: David Saile [mailto:[email protected]]
>>>> Gesendet: Montag, 9. Mai 2011 11:53
>>>> An: [email protected]
>>>> Betreff: Incremental clustering
>>>> 
>>>> Hi list,
>>>> 
>>>> I am completely new to Mahout, so please forgive me if the answer to my
>> question is too obvious.
>>>> 
>>>> For a case study, I am working on a simple incremental web crawler (much
>> like Nutch) and I want to include a very simple indexing step that
>> incorporates clustering of documents.
>>>> 
>>>> I was hoping to use some kind of incremental clustering algorithm, in
>> order to make use of the incremental way the crawler is supposed to work
>> (i.e. continuously adding and updating websites).
>>>> 
>>>> Is there some way to achieve the following:
>>>>     1) initial clustering of the first web-crawl
>>>>     2) assigning new sites to existing clusters
>>>>     3) possibly moving modified sites between clusters
>>>> 
>>>> I would really appreciate any help!
>>>> 
>>>> Thanks,
>>>> David
>>> 
>> 
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>> 
>> Search the Lucene ecosystem docs using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> 
>> 

Reply via email to