Tharindu,

If I understand what u r trying to do:-

a) You have a trained Bayes model.
b) You would like to classify new documents using this trained model.
c) You were trying to use TestNaiveBayesDriver to classify the documents in (b).

Option 1:
-----------

You could write a custom MapReduce job that creates sequence files from the 
documents (without the label key). You could feed these sequencefiles to 
seq2sparse to generate ur vectors -> call TestNAiveBayes with this input. Let 
me know if u need code for the earlier part.


Option 2:
-----------
Work with your existing tf-idf vectors generated from seqdirectory -> 
seq2sparse.  But instead of invoking Mahout TestNaiveBayes, create a custom 
MapReduce job (or a plain java program if that's fine with u) that basically 
does the following:

a) Instantiate a classifier with trained model: (Pseudo code below)

 NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new 
Path(outputDir.getAbsolutePath()), conf);

 AbstractVectorClassifier classifier = new 
StandardNaiveBayesClassifier(naiveBayesModel);

// Parse through the input tf-idf vectors <Text, VectorWritable> and feed them 
to the classifier

for (Pair<Text,VectorWritable> vector : new 
SequenceFileDirIterable<Text,VectorWritable>(getInputPath(), PathType.LIST,     
    PathFilters.logsCRCFilter(), null, true, conf)) {
    // invoke the classifier on the incoming vector
     Vector result = classifier.classifyFull(vector.getSecond().get());
     context.write(record.getFirst(), new VectorWritable(result));
}

You can have the above code as part of a mapper in an MR job.









On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart <kevinmoul...@gmail.com> 
wrote:
 
To use Naive Bayes you need a Sequence File <Text, VectorWritable> with the
key formatted like this "label/label" for some reason I checked with the
sources to be sure and it parses it looking for a '/'.

When y used seqdirectory, it told Naive Bayes to classify the content of
each file (ex : file1.txt) with the label corresponding to its name (here,
file1.txt). So when you tried testing with input0.txt it failed because
input0.txt was not a valid label.

I designed a MapReduce java job that transforms a csv with numeric values
into a proper SequenceFile, if you want you can take the source and update
if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils

Good luck.

Kévin Moulart



2014-03-18 20:13 GMT+01:00 Frank Scholten <fr...@frankscholten.nl>:

> Hi Tharindu,
>
> If I understand correctly seqdirectory creates labels based on the file
> name but this is not what you want. What do you want the labels to be?
>
> Cheers,
>
> Frank
>
>
> On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
> <tharindurus...@gmail.com>wrote:
>
> > Hi everyone,
> > I'm developing an application where I need to train a Naive Bayes
> > classification model and use this model to classify new entities(In this
> > case text files based on their content)
> >
> > I observed that seqdirectory command always adds the file/directory name
> as
> > the "key" field for each document which will be used as the label in
> > classification jobs.
> > This makes sense when I need to train a model and create the labelindex
> > since I have organized my training data according to their labels in
> > separate
 directories.
> >
> > Now I'm trying to use this model and infer the best label for an unknown
> > document.
> > My requirement is to ask Mahout to read my new file and output the
> > predicted category by looking at the labelindex and the tfidf vector of
> the
> > new content.
> > I tried creating vectors from the new content (seqdirectory and
> > seq2sparse), and then using this vector to run testnb command. But
> > unfortunately seqdirectory commands adds file names as labels which does
> > not make sense in classification.
> >
> > The following error message will further demonstrate this behavior.
> > imput0.txt is the file name of my new document.
> >
> > [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
> > classifying documents
> > java.lang.IllegalArgumentException: Label not found: input0.txt
> >     at
> >
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
> >     at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
> >     at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
> >     at
> >
> >
>
 
org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
> >     at
> >
> >
> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
> >     at
> >
> >
> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
> >     at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
> >     at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
> >    
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >     at
> >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66)
> >
> >
> > So how can I achieve what I'm trying to do here?
> >
> > Thanks,
> >
> >
> > --
> > M.P. Tharindu Rusira Kumara
> >
> > Department of Computer Science and Engineering,
> > University of Moratuwa,
> > Sri Lanka.
> > +94757033733
> > www.tharindu-rusira.blogspot.com
> >
>

Reply via email to