Hi Zach and Ted,
Thanks a lot for your answers :)
So I will try to focus on SVM instead of SGD/Naive Bayes.
I'll also take a look to Rapid Miner and Luduan.
Mahout in Action is saying that SVM has been added to Mahout as "an
experimental implementation"
Do you think it's usable for production anyway?
Thanks
Loic
Le 09.09.2011 19:54, Zach Richardson a écrit :
Hi Loic,
In my experience, when dealing with smaller datasets (i.e. the number of
training examples you have is less than, let's say 1000, or even less than
100 per category). That a Linear SVM tends to perform better than Mahout's
SGD.
I would either recommend using Rapid Miner if you want a pretty gui and some
configurable text import tools, or liblinear/libsvm from the command line.
The former will let you iterate quickly on what you are trying to do
without any custom coding. However, depending on how you want to deploy
this, you might need to stick with liblinear / libsvm (rapidminer uses the
libsvm library internally) for the true "deployable" system since the
Rapidminer libraries are all AGPL.
You can find examples for either online. If you still are having problems,
I would be more than happy to share a rapidminer pipeline for processing
documents, training a classifier, etc.
Zach
On Fri, Sep 9, 2011 at 12:16 PM, Ted Dunning<[email protected]> wrote:
On Fri, Sep 9, 2011 at 8:41 AM, Loic Descotte<[email protected]
wrote:
... My goal is to make prediction on thousands of text entries, but with
smaller as possible learning datas (categories may often change so I will
not always have hundreds of entries for training on each category).
This is very small with respect to Mahout algorithms. There may be better
options. The standard choice for small text datasets like this is linear
SVM, but SGD should work reasonably well. Naive Bayes may not work as well
with such a small amount of training data. I would avoid the adaptive SGD
and tune the training parameters by hand.
Another question, in all exemples I've found, Naive Bayes is used to
analyze
sets containing a lot keywords, and to classify them in the right
category
(e.g wikipedia examples : https://www.ibm.com/**
developerworks/java/library/j-**mahout/#N10412<
https://www.ibm.com/developerworks/java/library/j-mahout/#N10412>).
SGD example are a little different, instead of working on word sequences,
they use many predictors values and each predictor has only one value for
each entry.
That is true in Chapter 13 where SGD is introduced. Later chapters
illustrate the use on the 20 newsgroups data.
Is it possible to use the SGD algorythm (maybe better for me because I
have
small datasets) with only text (like blog posts) entries ?
Yes. This should work fine.
I would consider also the Luduan algorithm which is not currently part of
Mahout, although all the pieces are there.
The basic idea is that for each binary decision you have three kinds of
documents. These are unjudged documents, judged relevant documents and
judged non-relevant. Luduan uses log-likelihood ratio test to compare the
judged relevant and judged non-relevant sets. This comparison gives a set
of search terms that are used with standard retrieval weighting such as
tf-idf or BM-25. Term weights are determined by corpus frequencies without
any explicit reference to the frequencies in the judged relevant or
non-relevant documents.
For some classification tasks with modest sized training data, this method
out-performs most others.
I can send a PDF with a more detailed description.
Thanks a lot for your time, tell me if I'm not clear enough in my
explainations :)
Please tell me the same.