Hi guys, I was able to test this example. But how do I use the actual classifier? Once I train the data and have the model, I want to use the model to categorize new set of data which is not classified.
Is there any straight-forward way to do this with Mahout or should I be tweaking the code? Regards, ~Vivek On Fri, Nov 19, 2010 at 4:35 AM, JAGANADH G <[email protected]> wrote: > On Fri, Nov 19, 2010 at 1:15 PM, Divya <[email protected]> wrote: > > > for my first question u say we can put our own input documents in > directory > > that documents also should be of format similar to bayes-train-input. > > If yes, then I generated my input data using PrepareTwentyNewsgroups. > > And used that as my input for testclassifier > > But didn't get expected results. > > As I observed it didn't read my files I my input directory > > I tried replacing one of the files of input directory with one of the > files > > of train-input directory > > Still same result. > > Why is it not reading my files? > > > > Am I missing anything . > > > > > I think some thing happened wrong with your training . > I trained 20-news groups and tested it. My result is available at > http://pastebin.com/kGY4LmW7 . Check it. > > The commad which i used for > 1) Preparing data is > bin/mahout prepare20newsgroups -p /home/jaganadhg/20news-bydate-train/ -o > 20news -c UTF-8 -a org.apache.mahout.vectorizer.DefaultAnalyzer > 2) to train : > bin/mahout trainclassifier -i 20news/ -o 20cbayesn -type cbayes -a 1.0 -ng > 2 > 3) to test : > bin/mahout testclassifier -m 20bayes -d 20news -type bayes -ng 2 -method > sequential > > The result is available at http://pastebin.com/kGY4LmW7 > > > > > > Come to my second question, that means we are testing the classifier > > against > > our inputs itself. > > Still I didn't understand. > > What I understood about classification is we have set of documents which > > will act as model for classification of new documents in the system. > > Am I right? > > > > > The documets are not acting as model. Mahout TrainClassifierr will create a > model out of the documents provided for training. > The command testclassifier takes following arguments > 1) a directory containing model (specified after -m ) > 2) a directory which containing documents for testing the classifier. > (specified after -d ) . Documents in this directory should be formatted > like > the wat we prepared document for training > 3) type of the classifier algo . Here I used bayes (specified after -type ) > 4) Defuault category name (specified after -default) you can set it as > "unknown" > 4) Value of Alpha_i used in training (specified after -a ). By default it > is > 1.0 > 5) Source of model dir (specified after -source). You can set it as hdfs > 6) Ngram sixe (specified after -ng) . The ngram size should be same as you > used in training > > A sample command with all these parameters are shown below > bin/mahout testclassifier -d movie -m movie-model/ -type bayes -default > unknown -a 1.0 -method sequential -source hdfs -e UTF-8 -ng 1 > > > > Doesn't Mahout works in same way ? > > > > Third question, yeah I am looking for Mahout's API for classification. > > > > A sample program is given below > > > http://bitbucket.org/jaganadhg/blog/src/995fa52d4fbc/bck9/java/src/org/bc/kl/ClassifierDemo.java > > For working it in real-time system you have to some more work . Find it :-) > > -- > ********************************** > JAGANADH G > http://jaganadhg.freeflux.net/blog >
