Re: Text Classification using Mahout

Grant Ingersoll Tue, 28 Sep 2010 04:05:48 -0700

On Sep 27, 2010, at 1:53 PM, Neil Ghosh wrote:

> HI Grant, 
> 
> Thanks so much for responding.you can reply to this in the mailing list.I 
> have changed my problem a little bit more common one.
> 
> I have already gone through the tutorial written by you in IBM site.It was 
> very good to start with.Thanks anyway.
> To be specific my problem is to classify a piece text crawled from web into 
> two classes 
> 
> 1.It is a +ve feedback 
> 2.It is -ve feed back.
> 
> I can  use the two news group example and create a model with some text (may 
> be a large no of text ) by inputtng the trainer with these two labels.Should 
> I leave everything to the trainer completely like this ?
>


Yes, that should be fine.  The trainer doesn't care about the name of the 
label, it just cares that the two sets are relatively independent.  Keep in 
mind, you should set aside some of your data for testing as well.

> Or Do I have flexibility to give some other input specific to my problem ? 
> Such as if words like "Problem", "Complaint" etc are more likely to appear in 
> a text containing grievance.  

You can provide a Weight, usually TF-IDF, that often does a good job of 
factoring in the importance of words.  If you have certain sentiment words that 
you think influence things one way or the other, you could consider a weighting 
process that adds weight to those words, I suppose, but I would want to 
experiment with that a bit.

> 
> Please let me know if you have any ideas and need more info from my side.
> 
> Thanks
> Neil
> 
> On Mon, Sep 27, 2010 at 6:12 PM, Grant Ingersoll <[email protected]> wrote:
> 
> On Sep 24, 2010, at 1:12 PM, Neil Ghosh wrote:
> 
> > Is there any other examples/documents/reference how to use mahout for* text
> > classification.
> > *
> > I went through and ran the following
> >
> >
> >   1. Wikipedia Bayes
> > Example<https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html>-
> > Classify Wikipedia data.
> >
> >
> >   1. Twenty 
> > Newsgroups<https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html>-
> > Classify the classic Twenty Newsgroups data.
> >
> > However these two are not much definitive and there aren't much explanation
> > for the examples .Please share if there are more documentation.
> 
> 
> What kinds of problems are you looking to solve?  In general, we don't have 
> too much in the way of special things for text other than we have various 
> utilities for converting text into Mahout's vector format based on various 
> weighting schemes.  Both of those examples just take and convert the text 
> into vectors and then either train or test on them.  I would agree, though, 
> that a good tutorial is needed.  It's a bit out of date in terms of the 
> actual commands, but I believe the concepts are still accurate: 
> http://www.ibm.com/developerworks/java/library/j-mahout/
> 
> See 
> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground
>  (and the creating vectors section).  Also see the Algorithms section.
> 
> 
> --------------------------
> Grant Ingersoll
> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8
> 
> 
> 
> 
> -- 
> Thanks and Regards
> Neil
> http://neilghosh.com
> 
> 
> 

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search

Re: Text Classification using Mahout

Reply via email to