Mahout's SGD will work on small numbers of examples if you go through the training data in randomized order many times. Even your small amount of data will suffice. I would recommend using OnlineLogisticRegression rather than AdaptiveLogisticRegression because of the multi-pass nature of your training. The normal way to use SGD is with L1 regularization.
When I wrote that, though, I was really suggesting that you experiment with R first. The glmnet package provides very nice cross validation capabilities that allow you to find good meta-learning parameters. In glmnet, there is only lamdba, the degree of regularization, but with the Mahout SGD you would also need to determine initial learning rate and the learning rate decay schedule as well as lambda. Since Mahout doesn't provide a framework for cross validation, you would have to code that up as well (it is pretty easy). On Sun, Jul 8, 2012 at 7:41 AM, Robin Anil <[email protected]> wrote: > Try using encodedvectorsfromsequencefile > On Jul 8, 2012 2:04 AM, "Alexander Aristov" <[email protected]> > wrote: > > > So what numbers shall I think about? 100,1000 training files per > category? > > > > When you was writingL1 regularized logistic regression did you mean SGD > > algorithm? Can I take it from example? > > > > thanks > > > > Best Regards > > Alexander Aristov > > > > > > On 8 July 2012 02:20, Ted Dunning <[email protected]> wrote: > > > > > This is a really tiny training set. NB works much better with larger > > data > > > sets. This pattern of performing much better on training data than on > > test > > > data indicates that the small data set is giving you problems. This > > could > > > be over-fitting but it is likely also exacerbated by the number of > > unknown > > > words being encountered. > > > > > > My own tendency would be to use L1 regularized logistic regression on > > this. > > > In R, glmnet is an excellent choice in that it gives you the chance to > > use > > > cross validation to determine expected performance. > > > > > > On Sat, Jul 7, 2012 at 1:48 PM, Alexander Aristov < > > > [email protected]> wrote: > > > > > > > People, > > > > > > > > I am implementing Naive Bayes classifier on my text data and get poor > > > > results. > > > > > > > > Self-Testing on trained data gives 95% pos and 5% neg results (not > > bad). > > > > But testing on hold out set gives 60-40% that is not good for me. > > > > > > > > I tried to play with vectorizer arguments but setting them randomly > > makes > > > > results only worse. I have 7 categories and about 20-90 docs per > > > category. > > > > > > > > What can you suggest me to do to improve results? Tried complementary > > NB > > > > alg but it gives approximately the same results. > > > > > > > > I use mahout trunk version 0.8. > > > > > > > > Best Regards > > > > Alexander Aristov > > > > > > > > > >
