Hi, Yeah I understood the logic behind it. First we have to provide the set of documents and train classifier build model out of it And when testing classifier whenever we provide input data after generating it in form of dataset. It will classify those data according the built model.
Even I am doing the same thing I am using the test input given with 20news-bydate.tar.gz data set As when we extract 20news-bydate.tar.gz we get two directories 20news-bydate-train and 20news-bydate-test out of which I am using to train the classifier and other to test classifier respectively. Steps I am following - 1. Extract dataset tar zxf 20news-bydate.tar.gz 2.Generate input dataset train classifier $ bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p examples/bin/work/20news-bydate/20news-bydate-train -o examples/bin/work/20news-bydate/bayes-train-input -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 3.Generate input dataset test classifier $ bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p examples/bin/work/20news-bydate/20news-bydate-test -o examples/bin/work/20news-bydate/20news-test-input -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 4. Train the classifier bin/mahout trainclassifier -i examples/bin/work/20news-bydate/bayes-train-input -o examples/bin/work/20news-bydate/bayes-model -type bayes -ng 1 -source hdfs 5.Test classifier $ bin/mahout testclassifier -m D:/mahout-0.4/examples/bin/work/20news-bydate/bayes-model -d D:/mahout-0.4/examples/bin/work/20news-test-input -type bayes -ng 1 -method sequential Not getting expected output. Can view my result @ http://pastebin.com/CicVMpST. Still trying to figure whats missing in my steps. Can any one help me. Regards, Divya -----Original Message----- From: JAGANADH G [mailto:[email protected]] Sent: Friday, November 19, 2010 5:36 PM To: Divya Cc: [email protected] Subject: Re: classification example doubts On Fri, Nov 19, 2010 at 1:15 PM, Divya <[email protected]> wrote: > for my first question u say we can put our own input documents in directory > that documents also should be of format similar to bayes-train-input. > If yes, then I generated my input data using PrepareTwentyNewsgroups. > And used that as my input for testclassifier > But didn't get expected results. > As I observed it didn't read my files I my input directory > I tried replacing one of the files of input directory with one of the files > of train-input directory > Still same result. > Why is it not reading my files? > > Am I missing anything . > > I think some thing happened wrong with your training . I trained 20-news groups and tested it. My result is available at http://pastebin.com/kGY4LmW7 . Check it. The commad which i used for 1) Preparing data is bin/mahout prepare20newsgroups -p /home/jaganadhg/20news-bydate-train/ -o 20news -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer 2) to train : bin/mahout trainclassifier -i 20news/ -o 20cbayesn -type cbayes -a 1.0 -ng 2 3) to test : bin/mahout testclassifier -m 20bayes -d 20news -type bayes -ng 2 -method sequential The result is available at http://pastebin.com/kGY4LmW7 > > Come to my second question, that means we are testing the classifier > against > our inputs itself. > Still I didn't understand. > What I understood about classification is we have set of documents which > will act as model for classification of new documents in the system. > Am I right? > The documets are not acting as model. Mahout TrainClassifierr will create a model out of the documents provided for training. The command testclassifier takes following arguments 1) a directory containing model (specified after -m ) 2) a directory which containing documents for testing the classifier. (specified after -d ) . Documents in this directory should be formatted like the wat we prepared document for training 3) type of the classifier algo . Here I used bayes (specified after -type ) 4) Defuault category name (specified after -default) you can set it as "unknown" 4) Value of Alpha_i used in training (specified after -a ). By default it is 1.0 5) Source of model dir (specified after -source). You can set it as hdfs 6) Ngram sixe (specified after -ng) . The ngram size should be same as you used in training A sample command with all these parameters are shown below bin/mahout testclassifier -d movie -m movie-model/ -type bayes -default unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1 > Doesn't Mahout works in same way ? > > Third question, yeah I am looking for Mahout's API for classification. > A sample program is given below http://bitbucket.org/jaganadhg/blog/src/995fa52d4fbc/bck9/java/src/org/bc/kl /ClassifierDemo.java For working it in real-time system you have to some more work . Find it :-) -- ********************************** JAGANADH G http://jaganadhg.freeflux.net/blog
