Sry I had not sean owen post as it is not update in mobile .

Syed Abdul kather
send from Samsung S3
On Aug 1, 2012 11:32 PM, "syed kather" <[email protected]> wrote:

> Hi salman mahmood,
>     Whydont you try to apply clustering first . Once you applied high
> level clustering then check the top terms . You avoid the cluster which you
> feel good and try to find inter cluster which you found that it has
> confusion . Once you found that all the clusters are fine . To make the
> cluster perfect I had indexed all the document into solr . Because by using
> solr I had removed stop words and applied snow ball filter like that .
> Then as you know the identified all the clusters . Now try to verify
> whether cluster top term are good . Now from that cluster by using cluster
> points split the documents and according to its cluster . Now you will have
> bunch document s as group . Now if you apply classification and train the
> set .
>
> I hope u understood .. this is the approach I had followed . Let me know
> if you had some ideas .
> Syed Abdul kather
> send from Samsung S3
> On Aug 1, 2012 10:38 PM, "Salman Mahmood" <[email protected]> wrote:
>
>> Hi all,
>>
>> I am stuck between a decision to apply classification or clustering on the
>> data set I got. The more I think about it, the more I get confused. Heres
>> what I am confronted with.
>>
>> I have got news documents (around 3000 and continuously increasing)
>> containing news about companies, investment, stocks, economy, quartly
>> income etc. My goal is to have the news sorted in such a way that I know
>> which news correspond to which company. e.g for the news item "Apple
>> launches new iphone", I need to associate the company Apple with it. A
>> particular news item/document only contains 'title' and 'description' so I
>> have to analyze the text in order to find out which company the news
>> referes to. It could be multiple companies too.
>>
>> To solve this, I turned to Mahout.
>>
>> I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel'
>> etc as top terms in my clusters and from there I would know the news in a
>> cluster corresponds to its cluster label, but things were a bit different.
>> I got 'investment', 'stocks', 'correspondence', 'green energy',
>> 'terminal',
>> 'shares', 'street', 'olympics' and lots of other terms as the top ones
>> (which makes sense as clustering algos' look for common terms). Although
>> there were some 'Apple' clusters but the news items associated with it
>> were
>> very few.I thought may be clustering is not for this kind of problem as
>> many of the company news goes into more general clusters(investment,
>> profit) instead of the specific company cluster(Apple).
>>
>> I started reading about classification which requires training data, The
>> name was convincing too as I actually want to 'classify' my news items
>> into
>> 'company names'. As I read on, I got an impression that the name
>> classification is a bit deceiving and the technique is used more for
>> prediction purposes as compared to classification. The other confusions
>> that I got was how can I prepare training data for news documents? lets
>> assume I have a list of companies that I am interested in. I write a
>> program to produce training data for the classifier. the program will see
>> if the news title or description contains the company name 'Apple' then
>> its
>> a news story about apple. Is this how I can prepare training data?(off
>> course I read that training data is actually a set of predictors and
>> target
>> variables). If so, then why should I use mahout classification in the
>> first
>> place? I should ditch mahout and instead use this little program that I
>> wrote for training data(which actually does the classification)
>>
>> You can see how confused I am about how to address this issue. Another
>> thing that concerns me is that if its possible to make a system this
>> intelligent, that if the news says 'iphone sales at a record high' without
>> using the word 'Apple', the system can classify it as a news related to
>> apple?
>>
>> Thank you in advance for pointing me in the right direction.
>>
>

Reply via email to