(Near) Realtime clustering

Gustavo Fernandes Tue, 23 Nov 2010 09:59:00 -0800

Hello, we have a mission to implement a system to cluster news articles in near 
real time mode. We have a large amount of articles (millions), and we started 
using k-means to created clusters based on a fixed value of "k". The problem is 
that we have a constant incoming flow of news articles and we can't afford to 
rely on a batch process, we need to be able to present users clustered articles 
as soon as they arrive in our database. So far our clusters are saved into a 
SequenceFile, as normally output by k-means driver. 
What would be the recommended way of approaching this problem with Mahout? Is 
it possible to manipulate the generated clusters and incrementally add new 
articles to them, or even forming new clusters without incurring the penalty of 
recalculating for every vector again? Is starting with k-means the right way? 
What would be the right combination of algorithms to provide incremental and 
fast clustering calculation?


TIA,
Gustavo

(Near) Realtime clustering

Reply via email to