Hello,

First mail for me on Mahout ML :)

I'm working on a classification problem and I'm trying to know which algorythm would be better for my needs. I've read that SGD is better than Naive Bayes for small-medium data sets. Does it mean that learning (train) data may be small or is it for small data sets (or both) ? Then, does "better" mean faster or does it also give more accurate results than Naive Bayes on this size of data sets?

My goal is to make prediction on thousands of text entries, but with smaller as possible learning datas (categories may often change so I will not always have hundreds of entries for training on each category).

Another question, in all exemples I've found, Naive Bayes is used to analyze sets containing a lot keywords, and to classify them in the right category (e.g wikipedia examples : https://www.ibm.com/developerworks/java/library/j-mahout/#N10412 ).

SGD example are a little different, instead of working on word sequences, they use many predictors values and each predictor has only one value for each entry.

E.G  (in mahout in action) :

 $MAHOUT_HOME/bin/mahout trainlogistic --input donut.csv \
--output ./model \
--target color --categories 2 \
*--predictors x y --types numeric \*
--features 20 --passes 100 --rate 50

In this example, for each entry the x and y predictor has only one value.

My need is more like the naive bayes wikipedia examples : I want to analyse a text and to automatically find its cateogry. So I have only one predictor variable (the words of the text) and this predictor variable is multivalued (several words).

Is it possible to use the SGD algorythm (maybe better for me because I have small datasets) with only text (like blog posts) entries ?

Thanks a lot for your time, tell me if I'm not clear enough in my explainations :)

Loic

Reply via email to