By non-text, do you mean continuous values? Or sparse sets of tokens? The general idea for Naive Bayes is that it requires input consisting of sparse sets of tokens.
On Wed, Aug 7, 2013 at 2:00 PM, John Meagher <[email protected]> wrote: > I'm just starting work with Mahout and I'm struggling getting an > example of a non-text based Naive Bayes classifier up and running. > The input will be feature vectors generated outside of Mahout. As a > test I'm using arff files (anything else CSV-ish will work). I've > been able to convert things into vectors in a few different ways, but > can't figure out what is needed to get the trainnb command to work. > > Does the label index need to be generated through some manual process > or something other than the arff.vector or trainnb command? > > Is there a specific format needed for the input arff files? Specific > columns in a specific order? > > > Here's what I've tried so far in both 0.7 from CDH4 and 0.8 direct from > Apache: > > $ wget http://repository.seasr.org/Datasets/UCI/arff/iris.arff > $ mahout arff.vector --input iris.arff --output iris.model --dictOut > iris.labels > > This works and seems to be right so far > > This is the command I think I need to train the Naive Bayes model. It > fails when creating the label index with the exception below. > > $ mahout trainnb -i iris.model/ -o iris.training -el -li > iris.training.labels > ... > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 1 > at > org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:123) > at > org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:180) > at > org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:94) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > ... > > > Thanks for the help, > John >
