Hi Young,
You did not mention what part(s) of Mahout you are using but I will
assume the clustering code. LDA is designed to deduce a set of topics
from a corpus of documents and does not require or allow the topics to
be predefined. Some of the other clustering algorithms (e.g. k-Means,
Fuzzy k-Means, Dirichlet) can be initialized with a set of topics
(clusters), but after the iterations these will likely have changed
significantly. K-Means can also be initialized by running Canopy over
your dataset but there is no hard-coding required by any Mahout
clustering. Once you have developed a set of topics (generally an
offline, batch process) you can use one of the clustering
implementations to quickly cluster new documents using those topics.
Of course, if you really want to use predefined topics then you should
look at some of the classification algorithms which can be trained to
sort your news articles on the fly.
Jeff
On 8/25/10 9:32 AM, Young wrote:
Hi all,
I am using the mahout to cluster the news and I could see the top words for
each cluster. But I am very keen to know how to define a topic for each
cluster? Do we have to hardcore the topic for the cluster?
I find an interesting sitehttp://search.carrot2.org/stable/search and they make
excellent topics clustering based on the page content.
Thank you very much.
--Young