Most of these algorithms can be done in an incremental fashion in which you
can add batches to the previous training.

On Thu, May 12, 2011 at 8:30 AM, Jeff Eastman <[email protected]> wrote:

> Most of the clustering drivers have two methods: one to train the clusterer
> with data to produce the cluster models; one to classify the data using a
> given set of cluster models. Currently the CLI only allows train followed by
> optional classify. We could pretty easily allow classify to be done
> stand-alone, and this would be useful in support of Grant's approach below.
>
> Jeff
>
> -----Original Message-----
> From: Grant Ingersoll [mailto:[email protected]]
> Sent: Thursday, May 12, 2011 3:32 AM
> To: [email protected]
> Subject: Re: AW: Incremental clustering
>
> From what I've seen, using Mahout's existing clustering methods, I think
> most people setup some schedule whereby they cluster the whole collection on
> a regular basis and then all docs that come in the meantime are simply
> assigned to the closest cluster until the next whole collection iteration is
> completed.  There are, of course, other variants one could do, such as kick
> off the whole clustering when some threshold of number of docs is reached.
>
> There are other clustering methods, as Benson alluded to, that may better
> support incremental approaches.
>
> On May 12, 2011, at 4:53 AM, David Saile wrote:
>
> > I am still stuck at this problem.
> >
> > Can anyone give me a heads-up on how existing systems handle this?
> > If a collection of documents is modified, is the clustering recomputed
> from scratch each time?
> > Or is there in fact any incremental way to handle an evolving set of
> documents?
> >
> > I would really appreciate any hint!
> >
> > Thanks,
> > David
> >
> >
> > Am 09.05.2011 um 12:45 schrieb Ulrich Poppendieck:
> >
> >> Not an answer, but a follow-up question:
> >> I would be interested in the very same thing, but with the possibility
> to assign new sites to existing clusters OR to new ones.
> >>
> >> Thanks in advance,
> >> Ulrich
> >>
> >> -----Ursprüngliche Nachricht-----
> >> Von: David Saile [mailto:[email protected]]
> >> Gesendet: Montag, 9. Mai 2011 11:53
> >> An: [email protected]
> >> Betreff: Incremental clustering
> >>
> >> Hi list,
> >>
> >> I am completely new to Mahout, so please forgive me if the answer to my
> question is too obvious.
> >>
> >> For a case study, I am working on a simple incremental web crawler (much
> like Nutch) and I want to include a very simple indexing step that
> incorporates clustering of documents.
> >>
> >> I was hoping to use some kind of incremental clustering algorithm, in
> order to make use of the incremental way the crawler is supposed to work
> (i.e. continuously adding and updating websites).
> >>
> >> Is there some way to achieve the following:
> >>      1) initial clustering of the first web-crawl
> >>      2) assigning new sites to existing clusters
> >>      3) possibly moving modified sites between clusters
> >>
> >> I would really appreciate any help!
> >>
> >> Thanks,
> >> David
> >
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem docs using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>

Reply via email to