Tharindu, If I understand what u r trying to do:-
a) You have a trained Bayes model. b) You would like to classify new documents using this trained model. c) You were trying to use TestNaiveBayesDriver to classify the documents in (b). Option 1: ----------- You could write a custom MapReduce job that creates sequence files from the documents (without the label key). You could feed these sequencefiles to seq2sparse to generate ur vectors -> call TestNAiveBayes with this input. Let me know if u need code for the earlier part. Option 2: ----------- Work with your existing tf-idf vectors generated from seqdirectory -> seq2sparse. But instead of invoking Mahout TestNaiveBayes, create a custom MapReduce job (or a plain java program if that's fine with u) that basically does the following: a) Instantiate a classifier with trained model: (Pseudo code below) NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new Path(outputDir.getAbsolutePath()), conf); AbstractVectorClassifier classifier = new StandardNaiveBayesClassifier(naiveBayesModel); // Parse through the input tf-idf vectors <Text, VectorWritable> and feed them to the classifier for (Pair<Text,VectorWritable> vector : new SequenceFileDirIterable<Text,VectorWritable>(getInputPath(), PathType.LIST, PathFilters.logsCRCFilter(), null, true, conf)) { // invoke the classifier on the incoming vector Vector result = classifier.classifyFull(vector.getSecond().get()); context.write(record.getFirst(), new VectorWritable(result)); } You can have the above code as part of a mapper in an MR job. On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart <kevinmoul...@gmail.com> wrote: To use Naive Bayes you need a Sequence File <Text, VectorWritable> with the key formatted like this "label/label" for some reason I checked with the sources to be sure and it parses it looking for a '/'. When y used seqdirectory, it told Naive Bayes to classify the content of each file (ex : file1.txt) with the label corresponding to its name (here, file1.txt). So when you tried testing with input0.txt it failed because input0.txt was not a valid label. I designed a MapReduce java job that transforms a csv with numeric values into a proper SequenceFile, if you want you can take the source and update if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils Good luck. Kévin Moulart 2014-03-18 20:13 GMT+01:00 Frank Scholten <fr...@frankscholten.nl>: > Hi Tharindu, > > If I understand correctly seqdirectory creates labels based on the file > name but this is not what you want. What do you want the labels to be? > > Cheers, > > Frank > > > On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira > <tharindurus...@gmail.com>wrote: > > > Hi everyone, > > I'm developing an application where I need to train a Naive Bayes > > classification model and use this model to classify new entities(In this > > case text files based on their content) > > > > I observed that seqdirectory command always adds the file/directory name > as > > the "key" field for each document which will be used as the label in > > classification jobs. > > This makes sense when I need to train a model and create the labelindex > > since I have organized my training data according to their labels in > > separate directories. > > > > Now I'm trying to use this model and infer the best label for an unknown > > document. > > My requirement is to ask Mahout to read my new file and output the > > predicted category by looking at the labelindex and the tfidf vector of > the > > new content. > > I tried creating vectors from the new content (seqdirectory and > > seq2sparse), and then using this vector to run testnb command. But > > unfortunately seqdirectory commands adds file names as labels which does > > not make sense in classification. > > > > The following error message will further demonstrate this behavior. > > imput0.txt is the file name of my new document. > > > > [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while > > classifying documents > > java.lang.IllegalArgumentException: Label not found: input0.txt > > at > > > com.google.common.base.Preconditions.checkArgument(Preconditions.java:125) > > at > > > > > org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182) > > at > > > > > org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205) > > at > > > > > org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209) > > at > > > > > org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173) > > at > > > > > org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70) > > at > > > > > org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160) > > at > > > > > org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at > > > > > org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66) > > > > > > So how can I achieve what I'm trying to do here? > > > > Thanks, > > > > > > -- > > M.P. Tharindu Rusira Kumara > > > > Department of Computer Science and Engineering, > > University of Moratuwa, > > Sri Lanka. > > +94757033733 > > www.tharindu-rusira.blogspot.com > > >