Try using encodedvectorsfromsequencefile On Jul 8, 2012 2:04 AM, "Alexander Aristov" <[email protected]> wrote:
> So what numbers shall I think about? 100,1000 training files per category? > > When you was writingL1 regularized logistic regression did you mean SGD > algorithm? Can I take it from example? > > thanks > > Best Regards > Alexander Aristov > > > On 8 July 2012 02:20, Ted Dunning <[email protected]> wrote: > > > This is a really tiny training set. NB works much better with larger > data > > sets. This pattern of performing much better on training data than on > test > > data indicates that the small data set is giving you problems. This > could > > be over-fitting but it is likely also exacerbated by the number of > unknown > > words being encountered. > > > > My own tendency would be to use L1 regularized logistic regression on > this. > > In R, glmnet is an excellent choice in that it gives you the chance to > use > > cross validation to determine expected performance. > > > > On Sat, Jul 7, 2012 at 1:48 PM, Alexander Aristov < > > [email protected]> wrote: > > > > > People, > > > > > > I am implementing Naive Bayes classifier on my text data and get poor > > > results. > > > > > > Self-Testing on trained data gives 95% pos and 5% neg results (not > bad). > > > But testing on hold out set gives 60-40% that is not good for me. > > > > > > I tried to play with vectorizer arguments but setting them randomly > makes > > > results only worse. I have 7 categories and about 20-90 docs per > > category. > > > > > > What can you suggest me to do to improve results? Tried complementary > NB > > > alg but it gives approximately the same results. > > > > > > I use mahout trunk version 0.8. > > > > > > Best Regards > > > Alexander Aristov > > > > > >
