Thank you Ted However, even with using the default OnlineLogisiticRegression I'm unable to get acceptable results when trying to replicate the gender-guesser discussed in the example of http://en.wikipedia.org/wiki/Naive_Bayes_classifier
For that particular problem, do you recommend I take a binning/discretization approach with naive bayes? Or continue trying to fine tune the SGD algorithm? At this stage, I'm just hopelessly guessing parameters for OnlineLogisiticRegression. Even when I reiterate over the same data set many thousands of times I'm unable to get a suitable model that can pick a female or male from a height,weight and shoe size. Thanks again for taking the time to answer me. -V On Tue, Jul 5, 2011 at 4:30 AM, Ted Dunning <[email protected]> wrote: > The wikipedia page recommends binning if you have a large amount of data > and > a supervised variable extraction method if not. These are both ways of > preprocessing to discretize continuous variables. > > On Mon, Jul 4, 2011 at 11:28 AM, Ted Dunning <[email protected]> > wrote: > > > The mahout implementation of Naive_Bayes does not use continuous > variables > > well. The best bet is to discretize these variables either individually > or > > together using k-means. Then use the discrete version for the > classifier. > > > > The random forest implementation and the SGD implementation are both > > happier with continuous variables. > > > > > > On Mon, Jul 4, 2011 at 8:01 AM, Vijay Santhanam < > [email protected] > > > wrote: > > > >> Hi, > >> > >> I'm new to Mahout and many of the machine learning ideas, but from what > I > >> understand of Naive Bayes classifier, it's possible to train a Naive > Bayes > >> model with continuous, categorical and word-like features from my > >> understanding of the wikipedia entry > >> http://en.wikipedia.org/wiki/Naive_Bayes_classifier > >> > >> The 20news and wikipedia examples currently in mahout from what I gather > >> only use a target categorical variable and a text-like variables. > >> > >> I'm trying to replicate the person-gender-guesser used in the wikipedia > >> article using mahout. > >> > >> Can anyone give me any tips about how to: > >> * format input files (train and test) for different data types > >> * inform the trainer and classifier which features are continuous, > >> categorical and word-like > >> > >> My dataset is quite small, so I'd like to be able to process this in > code > >> (using Vectors, Models, etc), but I'm quite confused about how to use > the > >> classifier.bayes packages to train/create model with all my feature > types. > >> > >> Thanks in advance for any guidance. > >> > >> Cheers, > >> -- > >> Vijay Santhanam > >> Software Engineer > >> http://au.linkedin.com/in/vijaysanthanam > >> 0407525087 > >> > > > > > -- Vijay Santhanam Software Engineer http://au.linkedin.com/in/vijaysanthanam 0407525087
