Re: SGD vs Naive Bayes for classification

Ted Dunning Fri, 09 Sep 2011 10:17:16 -0700

On Fri, Sep 9, 2011 at 8:41 AM, Loic Descotte <[email protected]>wrote:


> ... My goal is to make prediction on thousands of text entries, but with
> smaller as possible learning datas (categories may often change so I will
> not always have hundreds of entries for training on each category).
>

This is very small with respect to Mahout algorithms.  There may be better
options.  The standard choice for small text datasets like this is linear
SVM, but SGD should work reasonably well.  Naive Bayes may not work as well
with such a small amount of training data.  I would avoid the adaptive SGD
and tune the training parameters by hand.

Another question, in all exemples I've found, Naive Bayes is used to analyze
> sets containing a lot keywords, and to classify them in the right category
> (e.g wikipedia examples : https://www.ibm.com/**
> developerworks/java/library/j-**mahout/#N10412<https://www.ibm.com/developerworks/java/library/j-mahout/#N10412>).
>
> SGD example are a little different, instead of working on word sequences,
> they use many predictors values and each predictor has only one value for
> each entry.
>

That is true in Chapter 13 where SGD is introduced.  Later chapters
illustrate the use on the 20 newsgroups data.


> Is it possible to use the SGD algorythm (maybe better for me because I have
> small datasets) with only text (like blog posts) entries ?
>

Yes.  This should work fine.

I would consider also the Luduan algorithm which is not currently part of
Mahout, although all the pieces are there.

The basic idea is that for each binary decision you have three kinds of
documents.  These are unjudged documents, judged relevant documents and
judged non-relevant.  Luduan uses log-likelihood ratio test to compare the
judged relevant and judged non-relevant sets.  This comparison gives a set
of search terms that are used with standard retrieval weighting such as
tf-idf or BM-25.  Term weights are determined by corpus frequencies without
any explicit reference to the frequencies in the judged relevant or
non-relevant documents.

For some classification tasks with modest sized training data, this method
out-performs most others.

I can send a PDF with a more detailed description.


> Thanks a lot for your time, tell me if I'm not clear enough in my
> explainations :)
>

Please tell me the same.

Re: SGD vs Naive Bayes for classification

Reply via email to