Re: poor classifier results

Ted Dunning Sun, 08 Jul 2012 08:12:35 -0700

Mahout's SGD will work on small numbers of examples if you go through the
training  data in randomized order many times.  Even your small amount of
data will suffice.  I would recommend using OnlineLogisticRegression rather
than AdaptiveLogisticRegression because of the multi-pass nature of your
training.  The normal way to use SGD is with L1 regularization.


When I wrote that, though, I was really suggesting that you experiment with
R first.  The glmnet package provides very nice cross validation
capabilities that allow you to find good meta-learning parameters.  In
glmnet, there is only lamdba, the degree of regularization, but with the
Mahout SGD you would also need to determine initial learning rate and the
learning rate decay schedule as well as lambda.  Since Mahout doesn't
provide a framework for cross validation, you would have to code that up as
well (it is pretty easy).

On Sun, Jul 8, 2012 at 7:41 AM, Robin Anil <[email protected]> wrote:

> Try using encodedvectorsfromsequencefile
> On Jul 8, 2012 2:04 AM, "Alexander Aristov" <[email protected]>
> wrote:
>
> > So what numbers shall I think about? 100,1000 training files per
> category?
> >
> > When you was writingL1 regularized logistic regression did you mean SGD
> > algorithm? Can I take it from example?
> >
> > thanks
> >
> > Best Regards
> > Alexander Aristov
> >
> >
> > On 8 July 2012 02:20, Ted Dunning <[email protected]> wrote:
> >
> > > This is a really tiny training set.  NB works much better with larger
> > data
> > > sets.  This pattern of performing much better on training data than on
> > test
> > > data indicates that the small data set is giving you problems.  This
> > could
> > > be over-fitting but it is likely also exacerbated by the number of
> > unknown
> > > words being encountered.
> > >
> > > My own tendency would be to use L1 regularized logistic regression on
> > this.
> > >  In R, glmnet is an excellent choice in that it gives you the chance to
> > use
> > > cross validation to determine expected performance.
> > >
> > > On Sat, Jul 7, 2012 at 1:48 PM, Alexander Aristov <
> > > [email protected]> wrote:
> > >
> > > > People,
> > > >
> > > > I am implementing Naive Bayes classifier on my text data and get poor
> > > > results.
> > > >
> > > > Self-Testing on trained data gives 95% pos and 5% neg results (not
> > bad).
> > > > But testing on hold out set gives 60-40% that is not good for me.
> > > >
> > > > I tried to play with vectorizer arguments but setting them randomly
> > makes
> > > > results only worse. I have 7 categories and about 20-90 docs per
> > > category.
> > > >
> > > > What can you suggest me to do to improve results? Tried complementary
> > NB
> > > > alg but it gives approximately the same results.
> > > >
> > > > I use mahout trunk version 0.8.
> > > >
> > > > Best Regards
> > > > Alexander Aristov
> > > >
> > >
> >
>

Re: poor classifier results

Reply via email to