Most of the clustering drivers have two methods: one to train the clusterer 
with data to produce the cluster models; one to classify the data using a given 
set of cluster models. Currently the CLI only allows train followed by optional 
classify. We could pretty easily allow classify to be done stand-alone, and 
this would be useful in support of Grant's approach below.

Jeff

-----Original Message-----
From: Grant Ingersoll [mailto:[email protected]] 
Sent: Thursday, May 12, 2011 3:32 AM
To: [email protected]
Subject: Re: AW: Incremental clustering

>From what I've seen, using Mahout's existing clustering methods, I think most 
>people setup some schedule whereby they cluster the whole collection on a 
>regular basis and then all docs that come in the meantime are simply assigned 
>to the closest cluster until the next whole collection iteration is completed. 
> There are, of course, other variants one could do, such as kick off the whole 
>clustering when some threshold of number of docs is reached.

There are other clustering methods, as Benson alluded to, that may better 
support incremental approaches.

On May 12, 2011, at 4:53 AM, David Saile wrote:

> I am still stuck at this problem.
> 
> Can anyone give me a heads-up on how existing systems handle this? 
> If a collection of documents is modified, is the clustering recomputed from 
> scratch each time? 
> Or is there in fact any incremental way to handle an evolving set of 
> documents?
> 
> I would really appreciate any hint!
> 
> Thanks,
> David
> 
> 
> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
> 
>> Not an answer, but a follow-up question: 
>> I would be interested in the very same thing, but with the possibility to 
>> assign new sites to existing clusters OR to new ones.
>> 
>> Thanks in advance,
>> Ulrich
>> 
>> -----Ursprüngliche Nachricht-----
>> Von: David Saile [mailto:[email protected]] 
>> Gesendet: Montag, 9. Mai 2011 11:53
>> An: [email protected]
>> Betreff: Incremental clustering
>> 
>> Hi list,
>> 
>> I am completely new to Mahout, so please forgive me if the answer to my 
>> question is too obvious.
>> 
>> For a case study, I am working on a simple incremental web crawler (much 
>> like Nutch) and I want to include a very simple indexing step that 
>> incorporates clustering of documents.
>> 
>> I was hoping to use some kind of incremental clustering algorithm, in order 
>> to make use of the incremental way the crawler is supposed to work (i.e. 
>> continuously adding and updating websites).
>> 
>> Is there some way to achieve the following:  
>>      1) initial clustering of the first web-crawl
>>      2) assigning new sites to existing clusters
>>      3) possibly moving modified sites between clusters
>> 
>> I would really appreciate any help!
>> 
>> Thanks,
>> David
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to