Hi Young,

You did not mention what part(s) of Mahout you are using but I will assume the clustering code. LDA is designed to deduce a set of topics from a corpus of documents and does not require or allow the topics to be predefined. Some of the other clustering algorithms (e.g. k-Means, Fuzzy k-Means, Dirichlet) can be initialized with a set of topics (clusters), but after the iterations these will likely have changed significantly. K-Means can also be initialized by running Canopy over your dataset but there is no hard-coding required by any Mahout clustering. Once you have developed a set of topics (generally an offline, batch process) you can use one of the clustering implementations to quickly cluster new documents using those topics.

Of course, if you really want to use predefined topics then you should look at some of the classification algorithms which can be trained to sort your news articles on the fly.

Jeff


On 8/25/10 9:32 AM, Young wrote:
Hi all,
I am using the mahout to cluster the news and I could see the top words for 
each cluster. But I am very keen to know how to define a topic for each 
cluster? Do we have to hardcore the topic for the cluster?

I find an interesting sitehttp://search.carrot2.org/stable/search and they make 
excellent topics clustering based on the page content.

Thank you very much.

--Young

Reply via email to