The mahout implementation of Naive_Bayes does not use continuous variables well. The best bet is to discretize these variables either individually or together using k-means. Then use the discrete version for the classifier.
The random forest implementation and the SGD implementation are both happier with continuous variables. On Mon, Jul 4, 2011 at 8:01 AM, Vijay Santhanam <[email protected]>wrote: > Hi, > > I'm new to Mahout and many of the machine learning ideas, but from what I > understand of Naive Bayes classifier, it's possible to train a Naive Bayes > model with continuous, categorical and word-like features from my > understanding of the wikipedia entry > http://en.wikipedia.org/wiki/Naive_Bayes_classifier > > The 20news and wikipedia examples currently in mahout from what I gather > only use a target categorical variable and a text-like variables. > > I'm trying to replicate the person-gender-guesser used in the wikipedia > article using mahout. > > Can anyone give me any tips about how to: > * format input files (train and test) for different data types > * inform the trainer and classifier which features are continuous, > categorical and word-like > > My dataset is quite small, so I'd like to be able to process this in code > (using Vectors, Models, etc), but I'm quite confused about how to use the > classifier.bayes packages to train/create model with all my feature types. > > Thanks in advance for any guidance. > > Cheers, > -- > Vijay Santhanam > Software Engineer > http://au.linkedin.com/in/vijaysanthanam > 0407525087 >
