Hi all,
i want to know if it is possible to identify news topics for a given set of
articles by clustering them with mahout.
What i have is:
A set of news articles from different sources with a time-stamp and a weighted
vector of news categories for each article.
What i want is:
Clusters of articles from different sources that deal with the same topic.
I basically want to copy the key feature of google news: presenting topics and
listing different news sources for the same topic.
I guess i need to use a clustering algorithm that can:
1. build clusters without knowing how many clusters he needs to build in advance
2. Take a single new item as input and decide whether it belongs to a existing
cluster or represents a new cluster.
3. take into account the timestamp of an item, take into account the source of
items (same source == probably new topic)
I guess I'm not inventing the wheel here and there should be people that
tackled this problem already and can share some experience.
What i want to know is:
1. Is mahout the right starting point to solve this problem or should i be
looking at something else?
2. Did anyone on this mailinglist deal with this or a similar problems already
and can share some experience?
Thanks for reading,
Samy