Incremental classification into an existing set of clusters is pretty straightforward and is going to be getting easier as we merge the supervised and unsupervised classification interfaces.
You can also do online update of cluster centroids (not supported yet, but you can cannibalize existing code). Changing the number of clusters on-line can be daunting, but it is reasonable to look at the distribution of distance to nearest cluster for each new article. When that distribution shows significant disturbance, you can trigger another batch job with the current clusters as seeds. Making this all work on real data will be a bit of work, but doesn't have any major leaps of new knowledge to make it work. On Tue, Nov 23, 2010 at 9:58 AM, Gustavo Fernandes <[email protected]>wrote: > Hello, we have a mission to implement a system to cluster news articles in > near real time mode. We have a large amount of articles (millions), and we > started using k-means to created clusters based on a fixed value of "k". The > problem is that we have a constant incoming flow of news articles and we > can't afford to rely on a batch process, we need to be able to present users > clustered articles as soon as they arrive in our database. So far our > clusters are saved into a SequenceFile, as normally output by k-means > driver. > What would be the recommended way of approaching this problem with Mahout? > Is it possible to manipulate the generated clusters and incrementally add > new articles to them, or even forming new clusters without incurring the > penalty of recalculating for every vector again? Is starting with k-means > the right way? What would be the right combination of algorithms to provide > incremental and fast clustering calculation? > > TIA, > Gustavo
