Hi, first of all I'm sorry that my previous mail was vague and poorly
formulated.
Yes, Suneel got exactly what I was asking.Both  options will address my
requirement.
Thanks a lot.
-Tharindu
On Mar 19, 2014 8:51 AM, "Suneel Marthi" <suneel_mar...@yahoo.com> wrote:

> Tharindu,
>
> If I understand what u r trying to do:-
>
> a) You have a trained Bayes model.
> b) You would like to classify new documents using this trained model.
> c) You were trying to use TestNaiveBayesDriver to classify the documents
> in (b).
>
> Option 1:
> -----------
>
> You could write a custom MapReduce job that creates sequence files from
> the documents (without the label key). You could feed these sequencefiles
> to seq2sparse to generate ur vectors -> call TestNAiveBayes with this
> input. Let me know if u need code for the earlier part.
>
>
> Option 2:
> -----------
> Work with your existing tf-idf vectors generated from seqdirectory ->
> seq2sparse.  But instead of invoking Mahout TestNaiveBayes, create a custom
> MapReduce job (or a plain java program if that's fine with u) that
> basically does the following:
>
> a) Instantiate a classifier with trained model: (Pseudo code below)
>
>  NaiveBayesModel naiveBayesModel = NaiveBayesModel.materialize(new
> Path(outputDir.getAbsolutePath()), conf);
>
>  AbstractVectorClassifier classifier = new
> StandardNaiveBayesClassifier(naiveBayesModel);
>
> // Parse through the input tf-idf vectors <Text, VectorWritable> and feed
> them to the classifier
>
> for (Pair<Text,VectorWritable> vector : new
> SequenceFileDirIterable<Text,VectorWritable>(getInputPath(), PathType.LIST,
>         PathFilters.logsCRCFilter(), null, true, conf)) {
>     // invoke the classifier on the incoming vector
>      Vector result = classifier.classifyFull(vector.getSecond().get());
>      context.write(record.getFirst(), new VectorWritable(result));
> }
>
> You can have the above code as part of a mapper in an MR job.
>
>
>
>
>
>
>
>
>
> On Tuesday, March 18, 2014 5:49 PM, Kevin Moulart <kevinmoul...@gmail.com>
> wrote:
>
> To use Naive Bayes you need a Sequence File <Text, VectorWritable> with the
> key formatted like this "label/label" for some reason I checked with the
> sources to be sure and it parses it looking for a '/'.
>
> When y used seqdirectory, it told Naive Bayes to classify the content of
> each file (ex : file1.txt) with the label corresponding to its name (here,
> file1.txt). So when you tried testing with input0.txt it failed because
> input0.txt was not a valid label.
>
> I designed a MapReduce java job that transforms a csv with numeric values
> into a proper SequenceFile, if you want you can take the source and update
> if to suit your need : https://github.com/kmoulart/hadoop_mahout_utils
>
> Good luck.
>
> Kévin Moulart
>
>
>
> 2014-03-18 20:13 GMT+01:00 Frank Scholten <fr...@frankscholten.nl>:
>
> > Hi Tharindu,
> >
> > If I understand correctly seqdirectory creates labels based on the file
> > name but this is not what you want. What do you want the labels to be?
> >
> > Cheers,
> >
> > Frank
> >
> >
> > On Tue, Mar 18, 2014 at 2:22 PM, Tharindu Rusira
> > <tharindurus...@gmail.com>wrote:
> >
> > > Hi everyone,
> > > I'm developing an application where I need to train a Naive Bayes
> > > classification model and use this model to classify new entities(In
> this
> > > case text files based on their content)
> > >
> > > I observed that seqdirectory command always adds the file/directory
> name
> > as
> > > the "key" field for each document which will be used as the label in
> > > classification jobs.
> > > This makes sense when I need to train a model and create the labelindex
> > > since I have organized my training data according to their labels in
> > > separate
>  directories.
> > >
> > > Now I'm trying to use this model and infer the best label for an
> unknown
> > > document.
> > > My requirement is to ask Mahout to read my new file and output the
> > > predicted category by looking at the labelindex and the tfidf vector of
> > the
> > > new content.
> > > I tried creating vectors from the new content (seqdirectory and
> > > seq2sparse), and then using this vector to run testnb command. But
> > > unfortunately seqdirectory commands adds file names as labels which
> does
> > > not make sense in classification.
> > >
> > > The following error message will further demonstrate this behavior.
> > > imput0.txt is the file name of my new document.
> > >
> > > [main] ERROR com.me.classifier.mahout.MahoutClassifier - Error while
> > > classifying documents
> > > java.lang.IllegalArgumentException: Label not found: input0.txt
> > >     at
> > >
> >
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:125)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.ConfusionMatrix.getCount(ConfusionMatrix.java:182)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:205)
> > >     at
> > >
> > >
> >
>
>  
> org.apache.mahout.classifier.ConfusionMatrix.incrementCount(ConfusionMatrix.java:209)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.ConfusionMatrix.addInstance(ConfusionMatrix.java:173)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.ResultAnalyzer.addInstance(ResultAnalyzer.java:70)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.analyzeResults(TestNaiveBayesDriver.java:160)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.run(TestNaiveBayesDriver.java:125)
> > >
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >     at
> > >
> > >
> >
> org.apache.mahout.classifier.naivebayes.test.TestNaiveBayesDriver.main(TestNaiveBayesDriver.java:66)
> > >
> > >
> > > So how can I achieve what I'm trying to do here?
> > >
> > > Thanks,
> > >
> > >
> > > --
> > > M.P. Tharindu Rusira Kumara
> > >
> > > Department of Computer Science and Engineering,
> > > University of Moratuwa,
> > > Sri Lanka.
> > > +94757033733
> > > www.tharindu-rusira.blogspot.com
> > >
> >

Reply via email to