Hi Salman I have got news documents (around 3000 and continuously increasing) > containing news about companies, investment, stocks, economy, quartly > income etc. My goal is to have the news sorted in such a way that I know > which news correspond to which company. e.g for the news item "Apple > launches new iphone", I need to associate the company Apple with it. A > particular news item/document only contains 'title' and 'description' so I > have to analyze the text in order to find out which company the news > referes to. It could be multiple companies too. >
If this is the problem you are trying to solve. I would suggest a different solution. As you want to classify based on company only. Its better to use a NER system to identify the company names in the document and use the company names to map the articles to the company. This would be a simple and effective solution. > You can see how confused I am about how to address this issue. Another > thing that concerns me is that if its possible to make a system this > intelligent, that if the news says 'iphone sales at a record high' without > using the word 'Apple', the system can classify it as a news related to > apple? > This is hard to achieve. You may need to spend lot of time on creating the training set. Even then the possibility of such a system using classification is too low. But if you are going with a NER based solution you could customize the NER to identify the entities in this case "iPhone" and then map it to apple. This is achievable at low risk. Just a thought. i would not recommend mahout for such a problem. -- *Biju* **
