The wikipedia page recommends binning if you have a large amount of data and a supervised variable extraction method if not. These are both ways of preprocessing to discretize continuous variables.
On Mon, Jul 4, 2011 at 11:28 AM, Ted Dunning <[email protected]> wrote: > The mahout implementation of Naive_Bayes does not use continuous variables > well. The best bet is to discretize these variables either individually or > together using k-means. Then use the discrete version for the classifier. > > The random forest implementation and the SGD implementation are both > happier with continuous variables. > > > On Mon, Jul 4, 2011 at 8:01 AM, Vijay Santhanam <[email protected] > > wrote: > >> Hi, >> >> I'm new to Mahout and many of the machine learning ideas, but from what I >> understand of Naive Bayes classifier, it's possible to train a Naive Bayes >> model with continuous, categorical and word-like features from my >> understanding of the wikipedia entry >> http://en.wikipedia.org/wiki/Naive_Bayes_classifier >> >> The 20news and wikipedia examples currently in mahout from what I gather >> only use a target categorical variable and a text-like variables. >> >> I'm trying to replicate the person-gender-guesser used in the wikipedia >> article using mahout. >> >> Can anyone give me any tips about how to: >> * format input files (train and test) for different data types >> * inform the trainer and classifier which features are continuous, >> categorical and word-like >> >> My dataset is quite small, so I'd like to be able to process this in code >> (using Vectors, Models, etc), but I'm quite confused about how to use the >> classifier.bayes packages to train/create model with all my feature types. >> >> Thanks in advance for any guidance. >> >> Cheers, >> -- >> Vijay Santhanam >> Software Engineer >> http://au.linkedin.com/in/vijaysanthanam >> 0407525087 >> > >
