Sry I had not sean owen post as it is not update in mobile . Syed Abdul kather send from Samsung S3 On Aug 1, 2012 11:32 PM, "syed kather" <[email protected]> wrote:
> Hi salman mahmood, > Whydont you try to apply clustering first . Once you applied high > level clustering then check the top terms . You avoid the cluster which you > feel good and try to find inter cluster which you found that it has > confusion . Once you found that all the clusters are fine . To make the > cluster perfect I had indexed all the document into solr . Because by using > solr I had removed stop words and applied snow ball filter like that . > Then as you know the identified all the clusters . Now try to verify > whether cluster top term are good . Now from that cluster by using cluster > points split the documents and according to its cluster . Now you will have > bunch document s as group . Now if you apply classification and train the > set . > > I hope u understood .. this is the approach I had followed . Let me know > if you had some ideas . > Syed Abdul kather > send from Samsung S3 > On Aug 1, 2012 10:38 PM, "Salman Mahmood" <[email protected]> wrote: > >> Hi all, >> >> I am stuck between a decision to apply classification or clustering on the >> data set I got. The more I think about it, the more I get confused. Heres >> what I am confronted with. >> >> I have got news documents (around 3000 and continuously increasing) >> containing news about companies, investment, stocks, economy, quartly >> income etc. My goal is to have the news sorted in such a way that I know >> which news correspond to which company. e.g for the news item "Apple >> launches new iphone", I need to associate the company Apple with it. A >> particular news item/document only contains 'title' and 'description' so I >> have to analyze the text in order to find out which company the news >> referes to. It could be multiple companies too. >> >> To solve this, I turned to Mahout. >> >> I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel' >> etc as top terms in my clusters and from there I would know the news in a >> cluster corresponds to its cluster label, but things were a bit different. >> I got 'investment', 'stocks', 'correspondence', 'green energy', >> 'terminal', >> 'shares', 'street', 'olympics' and lots of other terms as the top ones >> (which makes sense as clustering algos' look for common terms). Although >> there were some 'Apple' clusters but the news items associated with it >> were >> very few.I thought may be clustering is not for this kind of problem as >> many of the company news goes into more general clusters(investment, >> profit) instead of the specific company cluster(Apple). >> >> I started reading about classification which requires training data, The >> name was convincing too as I actually want to 'classify' my news items >> into >> 'company names'. As I read on, I got an impression that the name >> classification is a bit deceiving and the technique is used more for >> prediction purposes as compared to classification. The other confusions >> that I got was how can I prepare training data for news documents? lets >> assume I have a list of companies that I am interested in. I write a >> program to produce training data for the classifier. the program will see >> if the news title or description contains the company name 'Apple' then >> its >> a news story about apple. Is this how I can prepare training data?(off >> course I read that training data is actually a set of predictors and >> target >> variables). If so, then why should I use mahout classification in the >> first >> place? I should ditch mahout and instead use this little program that I >> wrote for training data(which actually does the classification) >> >> You can see how confused I am about how to address this issue. Another >> thing that concerns me is that if its possible to make a system this >> intelligent, that if the news says 'iphone sales at a record high' without >> using the word 'Apple', the system can classify it as a news related to >> apple? >> >> Thank you in advance for pointing me in the right direction. >> >
