Tharindu,
If I understand what u r trying to do:-
a) You have a trained Bayes model.
b) You would like to classify new documents using this trained model.
c) You were trying to use TestNaiveBayesDriver to classify the documents in (b).
Option 1:
-----------
You could write a custom MapReduce job that creates sequence files from the
documents (without the label key). You could feed these sequencefiles to
seq2sparse to generate ur vectors -> call TestNAiveBayes with this input. Let
me know if u need code for the earlier part.
Option 2:
-----------
Work with your existing tf-idf vectors generated from seqdirectory ->
seq2sparse. But instead of invoking Mahout TestNaiveBayes, create a custom
MapReduce job (or a plain java program if that's fine with u) that basically
does the following:
a) Instantiate a classifier with trained model: (Pseudo code below)
NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new
Path(outputDir.getAbsolutePath()), conf);
AbstractVectorClassifier classifier = new
StandardNaiveBayesClassifier(naiveBayesModel);
// Parse through the input tf-idf vectors <Text, VectorWritable> and feed them
to the classifier
for (Pair<Text,VectorWritable> vector : new
SequenceFileDirIterable<Text,VectorWritable>(getInputPath(), PathType.LIST,
PathFilters.logsCRCFilter(), null, true, conf)) {
// invoke the classifier on the incoming vector
Vector result = classifier.classifyFull(vector.getSecond().get());
context.write(record.getFirst(), new VectorWritable(result));
}
You can have the above code as part of a mapper in an MR job.
On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart <[email protected]>
wrote:
To use Naive Bayes you need a Sequence File <Text, VectorWritable> with the
key formatted like this "label/label" for some reason I checked with the
sources to be sure and it parses it looking for a '/'.
When y used seqdirectory, it told Naive Bayes to classify the content of
each file (ex : file1.txt) with the label corresponding to its name (here,
file1.txt). So when you tried testing with input0.txt it failed because
input0.txt was not a valid label.
I designed a MapReduce java job that transforms a csv with numeric values
into a proper SequenceFile, if you want you can take the source and update
if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils
Good luck.
Kévin Moulart
2014-03-18 20:13 GMT+01:00 Frank Scholten <[email protected]>:
> Hi Tharindu,
>
> If I understand correctly seqdirectory creates labels based on the file
> name but this is not what you want. What do you want the labels to be?
>
> Cheers,
>
> Frank
>
>
> On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
> <[email protected]>wrote:
>
> > Hi everyone,
> > I'm developing an application where I need to train a Naive Bayes
> > classification model and use this model to classify new entities(In this
> > case text files based on their content)
> >
> > I observed that seqdirectory command always adds the file/directory name
> as
> > the "key" field for each document which will be used as the label in
> > classification jobs.
> > This makes sense when I need to train a model and create the labelindex
> > since I have organized my training data according to their labels in
> > separate
directories.
> >
> > Now I'm trying to use this model and infer the best label for an unknown
> > document.
> > My requirement is to ask Mahout to read my new file and output the
> > predicted category by looking at the labelindex and the tfidf vector of
> the
> > new content.
> > I tried creating vectors from the new content (seqdirectory and
> > seq2sparse), and then using this vector to run testnb command. But
> > unfortunately seqdirectory commands adds file names as labels which does
> > not make sense in classification.
> >
> > The following error message will further demonstrate this behavior.
> > imput0.txt is the file name of my new document.
> >
> > [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
> > classifying documents
> > java.lang.IllegalArgumentException: Label not found: input0.txt
> > at
> >
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
> > at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
> > at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
> > at
> >
> >
>
org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
> > at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
> > at
> >
> >
> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
> > at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
> > at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
> >
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66)
> >
> >
> > So how can I achieve what I'm trying to do here?
> >
> > Thanks,
> >
> >
> > --
> > M.P. Tharindu Rusira Kumara
> >
> > Department of Computer Science and Engineering,
> > University of Moratuwa,
> > Sri Lanka.
> > +94757033733
> > www.tharindu-rusira.blogspot.com
> >
>