Try using encodedvectorsfromsequencefile
On Jul 8, 2012 2:04 AM, "Alexander Aristov" <[email protected]>
wrote:

> So what numbers shall I think about? 100,1000 training files per category?
>
> When you was writingL1 regularized logistic regression did you mean SGD
> algorithm? Can I take it from example?
>
> thanks
>
> Best Regards
> Alexander Aristov
>
>
> On 8 July 2012 02:20, Ted Dunning <[email protected]> wrote:
>
> > This is a really tiny training set.  NB works much better with larger
> data
> > sets.  This pattern of performing much better on training data than on
> test
> > data indicates that the small data set is giving you problems.  This
> could
> > be over-fitting but it is likely also exacerbated by the number of
> unknown
> > words being encountered.
> >
> > My own tendency would be to use L1 regularized logistic regression on
> this.
> >  In R, glmnet is an excellent choice in that it gives you the chance to
> use
> > cross validation to determine expected performance.
> >
> > On Sat, Jul 7, 2012 at 1:48 PM, Alexander Aristov <
> > [email protected]> wrote:
> >
> > > People,
> > >
> > > I am implementing Naive Bayes classifier on my text data and get poor
> > > results.
> > >
> > > Self-Testing on trained data gives 95% pos and 5% neg results (not
> bad).
> > > But testing on hold out set gives 60-40% that is not good for me.
> > >
> > > I tried to play with vectorizer arguments but setting them randomly
> makes
> > > results only worse. I have 7 categories and about 20-90 docs per
> > category.
> > >
> > > What can you suggest me to do to improve results? Tried complementary
> NB
> > > alg but it gives approximately the same results.
> > >
> > > I use mahout trunk version 0.8.
> > >
> > > Best Regards
> > > Alexander Aristov
> > >
> >
>

Reply via email to