RE: AW: Incremental clustering

Jeff Eastman Thu, 12 May 2011 09:40:41 -0700

Sure,
Each iteration of the kmeans, fuzzyK & Dirichlet clustering algorithms begin 
with an initial (prior) set of clusters (a.k.a. models). Each iteration assigns 
each input vector to one (kmeans = most likely; Dirichlet = multinomial 
sampling) or multiple (fuzzyK = percentage of each) clusters. Then, at the end 
of the iteration, each cluster's parameters are recomputed based upon the 
observed data and the posterior clusters from iteration n become the prior 
clusters for iteration n+1.


Based upon discussions with Ted, I've been trying to recast clustering in terms 
of an unsupervised classification problem. This is most obvious is you look at 
the new ClusterClassifier & ClusterIterator, which implement all three 
algorithms in a single classification-ready engine. ClusterClassifier extends 
AbstractVectorClassifier and implements OnlineLearner. This means a 
ClusterClassifier produced by unsupervised training with some data can be used 
as a model in a semi-supervised classifier along with models obtained via 
supervised training.

I've adjusted the 3 Display clustering examples to use the ClusterClassifier so 
you can see that it works pretty well. I'm particularly pleased with how 
Dirichlet and Kmeans fit together using this approach.

-----Original Message-----
From: Benson Margulies [mailto:[email protected]] 
Sent: Thursday, May 12, 2011 9:14 AM
To: [email protected]
Subject: Re: AW: Incremental clustering

Jeff,

Could you expand a bit on the subject of models in clustering? I
mentally simplify this into 'clustering: unsupervised; classification:
supervised.'

Is the idea here that you are going to be presented with many
different corpora that have some sort of overall resemblance, so that
priors derived from the first N speed up clustering N+1?

--benson


On Thu, May 12, 2011 at 12:00 PM, Jeff Eastman <[email protected]> wrote:
> Sure, by using your old clusters as the prior (clustersIn) for the new 
> clustering, you can reduce the number of iterations required to converge.
>
> -----Original Message-----
> From: David Saile [mailto:[email protected]]
> Sent: Thursday, May 12, 2011 8:54 AM
> To: [email protected]
> Subject: Re: AW: Incremental clustering
>
> Thank you very much everyone! This really helped a lot.
>
> Here is what I am planning to do:
> I am going to compute an initial clustering after the first crawl.
> Then, as sites are being added to the index I will simply classify them using 
> the existing clusters.
>
> As I expect updates to be generally very small, I will only recompute the 
> clustering after some threshold has been hit, like Grant suggested.
> As Ted pointed out, this can be done with the old clusters as input.
>
> Thanks again,
> David
>
>
>
> Am 12.05.2011 um 17:35 schrieb Ted Dunning:
>
>> Most of these algorithms can be done in an incremental fashion in which you
>> can add batches to the previous training.
>>
>> On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <[email protected]> wrote:
>>
>>> Most of the clustering drivers have two methods: one to train the clusterer
>>> with data to produce the cluster models; one to classify the data using a
>>> given set of cluster models. Currently the CLI only allows train followed by
>>> optional classify. We could pretty easily allow classify to be done
>>> stand-alone, and this would be useful in support of Grant's approach below.
>>>
>>> Jeff
>>>
>>> -----Original Message-----
>>> From: Grant Ingersoll [mailto:[email protected]]
>>> Sent: Thursday, May 12, 2011 3:32 AM
>>> To: [email protected]
>>> Subject: Re: AW: Incremental clustering
>>>
>>> From what I've seen, using Mahout's existing clustering methods, I think
>>> most people setup some schedule whereby they cluster the whole collection on
>>> a regular basis and then all docs that come in the meantime are simply
>>> assigned to the closest cluster until the next whole collection iteration is
>>> completed.  There are, of course, other variants one could do, such as kick
>>> off the whole clustering when some threshold of number of docs is reached.
>>>
>>> There are other clustering methods, as Benson alluded to, that may better
>>> support incremental approaches.
>>>
>>> On May 12, 2011, at 4:53 AM, David Saile wrote:
>>>
>>>> I am still stuck at this problem.
>>>>
>>>> Can anyone give me a heads-up on how existing systems handle this?
>>>> If a collection of documents is modified, is the clustering recomputed
>>> from scratch each time?
>>>> Or is there in fact any incremental way to handle an evolving set of
>>> documents?
>>>>
>>>> I would really appreciate any hint!
>>>>
>>>> Thanks,
>>>> David
>>>>
>>>>
>>>> Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
>>>>
>>>>> Not an answer, but a follow-up question:
>>>>> I would be interested in the very same thing, but with the possibility
>>> to assign new sites to existing clusters OR to new ones.
>>>>>
>>>>> Thanks in advance,
>>>>> Ulrich
>>>>>
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: David Saile [mailto:[email protected]]
>>>>> Gesendet: Montag, 9. Mai 2011 11:53
>>>>> An: [email protected]
>>>>> Betreff: Incremental clustering
>>>>>
>>>>> Hi list,
>>>>>
>>>>> I am completely new to Mahout, so please forgive me if the answer to my
>>> question is too obvious.
>>>>>
>>>>> For a case study, I am working on a simple incremental web crawler (much
>>> like Nutch) and I want to include a very simple indexing step that
>>> incorporates clustering of documents.
>>>>>
>>>>> I was hoping to use some kind of incremental clustering algorithm, in
>>> order to make use of the incremental way the crawler is supposed to work
>>> (i.e. continuously adding and updating websites).
>>>>>
>>>>> Is there some way to achieve the following:
>>>>>     1) initial clustering of the first web-crawl
>>>>>     2) assigning new sites to existing clusters
>>>>>     3) possibly moving modified sites between clusters
>>>>>
>>>>> I would really appreciate any help!
>>>>>
>>>>> Thanks,
>>>>> David
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem docs using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>
>

RE: AW: Incremental clustering

Reply via email to