Suneel, thanks a lot.

I assume the example you mentioned was generating a numerical vector for
each paragraph, is it right?

now, to further improve the performance, I may add other features from
other data set into this vector and make it much longer, then use the
enriched vector for naive bayes, k-means, nearest neighbor, etc, how should
we do the normalization? does it make sense?

thanks,



On Thu, Jan 16, 2014 at 11:08 PM, Suneel Marthi <[email protected]>wrote:

> See
> http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/
> for classifying twitter messages.
>
> Lucene has support for ngrams, stopwords, porter stemmer, snowball
> stemmer, language specific analyzers etc...
> Mahout uses Lucene for vectorization (part of Mahout's seq2sparse
> process).
> See http://mahout.apache.org/users/basics/creating-vectors-from-text.html
>
>
>
>
>
>
>
> On Thursday, January 16, 2014 10:57 PM, qiaoresearcher <
> [email protected]> wrote:
>
> Mahout has an example of using naive bayes to classify 20 news group. but
> how to just classify paragraphs  (e.g. twitter message, movie review) in
> text files such as:
>
> Text files has content like:
> ----------------------------------------------------------
> text paragraph 1                     class a
> text paragraph 2                     class b
> text paragraph 3                     class a
> text paragraph 4                     class b
> .............                                      ...
>
> does it support n grams, stem, stop words, etc?
>
> thanks for any suggestions.
>

Reply via email to