Hi Sean, Thank you for the clarification. So you are saying that Mahout is not suitable in this case or did you say clustering is not the right way to go and If its worth it, I should go for classification?
Secondly are you the same Sean Owen who wrote Mahout in Action? :) On Wed, Aug 1, 2012 at 7:39 PM, Sean Owen <[email protected]> wrote: > Classifiers are supervised learning algorithms, so you need to provide > a bunch of examples of positive and negative classes. In your example, > it would be fine to label a bunch of articles as "about Apple" or not, > then use feature vectors derived from TF-IDF as input, with these > labels, to train a classifier that can tell when an article is "about > Apple". > > I don't think it will quite work to automatically generate the > training set by labeling according to the simple rule, that it is > about Apple if 'Apple' is in the title. Well, if you do that, then > there is no point in training a classifier. You can make a trivial > classifier that achieves 100% accuracy on your test set by just > checking if 'Apple' is in the title! Yes, you are right, this gains > you nothing. > > Clearly you want to learn something subtler from the classifier, so > that an article titled "Apple juice shown to reduce risk of dementia" > isn't classified as about the company. You'd really need to feed it > hand-classified documents. > > That's the bad news, but, sure you can certainly train N classifiers > for N topics this way. > > Classifiers put items into a class or not. They are not the same as > regression techniques which predict a continuous value for an input. > They're related but distinct. > > > Clustering has the advantage of being unsupervised. You don't need > labels. However the resulting clusters are not guaranteed to match up > to your notion of article topics. You may see a cluster that has a lot > of Apple articles, some about the iPod, but also some about Samsung > and laptops in general. I don't think this is the best tool for your > problem. > > > > > On Wed, Aug 1, 2012 at 6:08 PM, Salman Mahmood <[email protected]> > wrote: > > Hi all, > > > > I am stuck between a decision to apply classification or clustering on > the > > data set I got. The more I think about it, the more I get confused. Heres > > what I am confronted with. > > > > I have got news documents (around 3000 and continuously increasing) > > containing news about companies, investment, stocks, economy, quartly > > income etc. My goal is to have the news sorted in such a way that I know > > which news correspond to which company. e.g for the news item "Apple > > launches new iphone", I need to associate the company Apple with it. A > > particular news item/document only contains 'title' and 'description' so > I > > have to analyze the text in order to find out which company the news > > referes to. It could be multiple companies too. > > > > To solve this, I turned to Mahout. > > > > I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel' > > etc as top terms in my clusters and from there I would know the news in a > > cluster corresponds to its cluster label, but things were a bit > different. > > I got 'investment', 'stocks', 'correspondence', 'green energy', > 'terminal', > > 'shares', 'street', 'olympics' and lots of other terms as the top ones > > (which makes sense as clustering algos' look for common terms). Although > > there were some 'Apple' clusters but the news items associated with it > were > > very few.I thought may be clustering is not for this kind of problem as > > many of the company news goes into more general clusters(investment, > > profit) instead of the specific company cluster(Apple). > > > > I started reading about classification which requires training data, The > > name was convincing too as I actually want to 'classify' my news items > into > > 'company names'. As I read on, I got an impression that the name > > classification is a bit deceiving and the technique is used more for > > prediction purposes as compared to classification. The other confusions > > that I got was how can I prepare training data for news documents? lets > > assume I have a list of companies that I am interested in. I write a > > program to produce training data for the classifier. the program will see > > if the news title or description contains the company name 'Apple' then > its > > a news story about apple. Is this how I can prepare training data?(off > > course I read that training data is actually a set of predictors and > target > > variables). If so, then why should I use mahout classification in the > first > > place? I should ditch mahout and instead use this little program that I > > wrote for training data(which actually does the classification) > > > > You can see how confused I am about how to address this issue. Another > > thing that concerns me is that if its possible to make a system this > > intelligent, that if the news says 'iphone sales at a record high' > without > > using the word 'Apple', the system can classify it as a news related to > > apple? > > > > Thank you in advance for pointing me in the right direction. >
