Hello,
First mail for me on Mahout ML :)
I'm working on a classification problem and I'm trying to know which
algorythm would be better for my needs.
I've read that SGD is better than Naive Bayes for small-medium data
sets. Does it mean that learning (train) data may be small or is it for
small data sets (or both) ?
Then, does "better" mean faster or does it also give more accurate
results than Naive Bayes on this size of data sets?
My goal is to make prediction on thousands of text entries, but with
smaller as possible learning datas (categories may often change so I
will not always have hundreds of entries for training on each category).
Another question, in all exemples I've found, Naive Bayes is used to
analyze sets containing a lot keywords, and to classify them in the
right category (e.g wikipedia examples :
https://www.ibm.com/developerworks/java/library/j-mahout/#N10412 ).
SGD example are a little different, instead of working on word
sequences, they use many predictors values and each predictor has only
one value for each entry.
E.G (in mahout in action) :
$MAHOUT_HOME/bin/mahout trainlogistic --input donut.csv \
--output ./model \
--target color --categories 2 \
*--predictors x y --types numeric \*
--features 20 --passes 100 --rate 50
In this example, for each entry the x and y predictor has only one value.
My need is more like the naive bayes wikipedia examples : I want to
analyse a text and to automatically find its cateogry. So I have only
one predictor variable (the words of the text) and this predictor
variable is multivalued (several words).
Is it possible to use the SGD algorythm (maybe better for me because I
have small datasets) with only text (like blog posts) entries ?
Thanks a lot for your time, tell me if I'm not clear enough in my
explainations :)
Loic