Suneel, thanks a lot. I assume the example you mentioned was generating a numerical vector for each paragraph, is it right?
now, to further improve the performance, I may add other features from other data set into this vector and make it much longer, then use the enriched vector for naive bayes, k-means, nearest neighbor, etc, how should we do the normalization? does it make sense? thanks, On Thu, Jan 16, 2014 at 11:08 PM, Suneel Marthi <[email protected]>wrote: > See > http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/ > for classifying twitter messages. > > Lucene has support for ngrams, stopwords, porter stemmer, snowball > stemmer, language specific analyzers etc... > Mahout uses Lucene for vectorization (part of Mahout's seq2sparse > process). > See http://mahout.apache.org/users/basics/creating-vectors-from-text.html > > > > > > > > On Thursday, January 16, 2014 10:57 PM, qiaoresearcher < > [email protected]> wrote: > > Mahout has an example of using naive bayes to classify 20 news group. but > how to just classify paragraphs (e.g. twitter message, movie review) in > text files such as: > > Text files has content like: > ---------------------------------------------------------- > text paragraph 1 class a > text paragraph 2 class b > text paragraph 3 class a > text paragraph 4 class b > ............. ... > > does it support n grams, stem, stop words, etc? > > thanks for any suggestions. >
