Hi all,

i want to know if it is possible to identify news topics for a given set of 
articles by clustering them with mahout.

What i have is:
A set of news articles from different sources with a time-stamp and a weighted 
vector of news categories for each article.

What i want is:
Clusters of articles from different sources that deal with the same topic.

I basically want to copy the key feature of google news: presenting topics and 
listing different news sources for the same topic.

I guess i need to use a clustering algorithm that can:
1. build clusters without knowing how many clusters he needs to build in advance
2. Take a single new item as input and decide whether it belongs to a existing 
cluster or represents a new cluster. 
3. take into account the timestamp of an item,  take into account the source of 
items (same source == probably new topic)

I guess I'm not inventing the wheel here and there should be people that 
tackled this problem already and can share some experience.

What i want to know is:
1. Is mahout the right starting point to solve this problem or should i be 
looking at something else?
2. Did anyone on this mailinglist deal with this or a similar problems already 
and can share some experience?

Thanks for reading,

Samy
                                          

Reply via email to