Hi Grant , Thank you for the clarification ! Is there any reference/example how to provide Weight as TF-IDF in case of certain words or phrases ?
On Tue, Sep 28, 2010 at 4:35 PM, Grant Ingersoll <[email protected]>wrote: > > On Sep 27, 2010, at 1:53 PM, Neil Ghosh wrote: > > HI Grant, > > Thanks so much for responding.you can reply to this in the mailing list.I > have changed my problem a little bit more common one. > > I have already gone through the tutorial written by you in IBM site.It was > very good to start with.Thanks anyway. > To be specific my problem is to classify a piece text crawled from web into > two classes > > 1.It is a +ve feedback > 2.It is -ve feed back. > > I can use the two news group example and create a model with some text > (may be a large no of text ) by inputtng the trainer with these two > labels.Should I leave everything to the trainer completely like this ? > > > Yes, that should be fine. The trainer doesn't care about the name of the > label, it just cares that the two sets are relatively independent. Keep in > mind, you should set aside some of your data for testing as well. > > Or Do I have flexibility to give some other input specific to my problem ? > Such as if words like "Problem", "Complaint" etc are more likely to appear > in a text containing grievance. > > > You can provide a Weight, usually TF-IDF, that often does a good job of > factoring in the importance of words. If you have certain sentiment words > that you think influence things one way or the other, you could consider a > weighting process that adds weight to those words, I suppose, but I would > want to experiment with that a bit. > > > Please let me know if you have any ideas and need more info from my side. > > Thanks > Neil > > On Mon, Sep 27, 2010 at 6:12 PM, Grant Ingersoll <[email protected]>wrote: > >> >> On Sep 24, 2010, at 1:12 PM, Neil Ghosh wrote: >> >> > Is there any other examples/documents/reference how to use mahout for* >> text >> > classification. >> > * >> > I went through and ran the following >> > >> > >> > 1. Wikipedia Bayes >> > Example<https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html>- >> > Classify Wikipedia data. >> > >> > >> > 1. Twenty Newsgroups< >> https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html>- >> > Classify the classic Twenty Newsgroups data. >> > >> > However these two are not much definitive and there aren't much >> explanation >> > for the examples .Please share if there are more documentation. >> >> >> What kinds of problems are you looking to solve? In general, we don't >> have too much in the way of special things for text other than we have >> various utilities for converting text into Mahout's vector format based on >> various weighting schemes. Both of those examples just take and convert the >> text into vectors and then either train or test on them. I would agree, >> though, that a good tutorial is needed. It's a bit out of date in terms of >> the actual commands, but I believe the concepts are still accurate: >> http://www.ibm.com/developerworks/java/library/j-mahout/ >> >> See >> https://cwiki.apache.org/confluence/display/MAHOUT/Mahout+Wiki#MahoutWiki-ImplementationBackground(and >> the creating vectors section). Also see the Algorithms section. >> >> >> -------------------------- >> Grant Ingersoll >> http://lucenerevolution.org Apache Lucene/Solr Conference, Boston Oct 7-8 >> >> > > > -- > Thanks and Regards > Neil > http://neilghosh.com > > > > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem docs using Solr/Lucene: > http://www.lucidimagination.com/search > > -- Thanks and Regards Neil http://neilghosh.com
