On Fri, Nov 19, 2010 at 1:15 PM, Divya <[email protected]> wrote:
> for my first question u say we can put our own input documents in directory > that documents also should be of format similar to bayes-train-input. > If yes, then I generated my input data using PrepareTwentyNewsgroups. > And used that as my input for testclassifier > But didn't get expected results. > As I observed it didn't read my files I my input directory > I tried replacing one of the files of input directory with one of the files > of train-input directory > Still same result. > Why is it not reading my files? > > Am I missing anything . > > I think some thing happened wrong with your training . I trained 20-news groups and tested it. My result is available at http://pastebin.com/kGY4LmW7 . Check it. The commad which i used for 1) Preparing data is bin/mahout prepare20newsgroups -p /home/jaganadhg/20news-bydate-train/ -o 20news -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer 2) to train : bin/mahout trainclassifier -i 20news/ -o 20cbayesn -type cbayes -a 1.0 -ng 2 3) to test : bin/mahout testclassifier -m 20bayes -d 20news -type bayes -ng 2 -method sequential The result is available at http://pastebin.com/kGY4LmW7 > > Come to my second question, that means we are testing the classifier > against > our inputs itself. > Still I didn't understand. > What I understood about classification is we have set of documents which > will act as model for classification of new documents in the system. > Am I right? > The documets are not acting as model. Mahout TrainClassifierr will create a model out of the documents provided for training. The command testclassifier takes following arguments 1) a directory containing model (specified after -m ) 2) a directory which containing documents for testing the classifier. (specified after -d ) . Documents in this directory should be formatted like the wat we prepared document for training 3) type of the classifier algo . Here I used bayes (specified after -type ) 4) Defuault category name (specified after -default) you can set it as "unknown" 4) Value of Alpha_i used in training (specified after -a ). By default it is 1.0 5) Source of model dir (specified after -source). You can set it as hdfs 6) Ngram sixe (specified after -ng) . The ngram size should be same as you used in training A sample command with all these parameters are shown below bin/mahout testclassifier -d movie -m movie-model/ -type bayes -default unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1 > Doesn't Mahout works in same way ? > > Third question, yeah I am looking for Mahout's API for classification. > A sample program is given below http://bitbucket.org/jaganadhg/blog/src/995fa52d4fbc/bck9/java/src/org/bc/kl/ClassifierDemo.java For working it in real-time system you have to some more work . Find it :-) -- ********************************** JAGANADH G http://jaganadhg.freeflux.net/blog
