What do you recommend for vectorizing the new docs? Run seq2sparse on
a batch of them? Seems there's no code at the moment for quickly
vectorizing a few new documents based on the existing dictionary.

Frank

On Thu, May 12, 2011 at 12:32 PM, Grant Ingersoll <[email protected]> wrote:
> From what I've seen, using Mahout's existing clustering methods, I think most 
> people setup some schedule whereby they cluster the whole collection on a 
> regular basis and then all docs that come in the meantime are simply assigned 
> to the closest cluster until the next whole collection iteration is 
> completed.  There are, of course, other variants one could do, such as kick 
> off the whole clustering when some threshold of number of docs is reached.
>
> There are other clustering methods, as Benson alluded to, that may better 
> support incremental approaches.
>
> On May 12, 2011, at 4:53 AM, David Saile wrote:
>
>> I am still stuck at this problem.
>>
>> Can anyone give me a heads-up on how existing systems handle this?
>> If a collection of documents is modified, is the clustering recomputed from 
>> scratch each time?
>> Or is there in fact any incremental way to handle an evolving set of 
>> documents?
>>
>> I would really appreciate any hint!
>>
>> Thanks,
>> David
>>
>>
>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>
>>> Not an answer, but a follow-up question:
>>> I would be interested in the very same thing, but with the possibility to 
>>> assign new sites to existing clusters OR to new ones.
>>>
>>> Thanks in advance,
>>> Ulrich
>>>
>>> -----Ursprüngliche Nachricht-----
>>> Von: David Saile [mailto:[email protected]]
>>> Gesendet: Montag, 9. Mai 2011 11:53
>>> An: [email protected]
>>> Betreff: Incremental clustering
>>>
>>> Hi list,
>>>
>>> I am completely new to Mahout, so please forgive me if the answer to my 
>>> question is too obvious.
>>>
>>> For a case study, I am working on a simple incremental web crawler (much 
>>> like Nutch) and I want to include a very simple indexing step that 
>>> incorporates clustering of documents.
>>>
>>> I was hoping to use some kind of incremental clustering algorithm, in order 
>>> to make use of the incremental way the crawler is supposed to work (i.e. 
>>> continuously adding and updating websites).
>>>
>>> Is there some way to achieve the following:
>>>      1) initial clustering of the first web-crawl
>>>      2) assigning new sites to existing clusters
>>>      3) possibly moving modified sites between clusters
>>>
>>> I would really appreciate any help!
>>>
>>> Thanks,
>>> David
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Reply via email to