excuse me please, a typo in my previous post. The train and test calls were reversed.
On Mon, Dec 6, 2010 at 6:12 AM, Frank Wang <[email protected]> wrote: > I'm seeing this problem on Ubuntu as well. > > *Issue 1:* > Test result is all 0's. > http://pastebin.com/CicVMpST > > The steps are: > 1. Train: > $MAHOUT_HOME/bin/mahout testclassifier -m newsmodel -d 20news-input > -type bayes -ng 1 -source hdfs -method sequential > > 2. Test > $MAHOUT_HOME/bin/mahout trainclassifier -i 20news-input -o newsmodel > -type bayes -ng 1 -source hdfs > > The output are all 0's. > > *Issue 2:* > Also, when I do > "bin/mahout trainclassifier > -i examples/bin/work/20news-bydate/bayes-train-input > -o examples/bin/work/20news-bydate/bayes-model" > > I get the error > "Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > hdfs://localhost:9000/user/root/examples/bin/work/20news-bydate/bayes-train-input" > > I digged into the code, it seems that trainclassifier only accepts HDFS or > HBASE, is there a way to read file directly from a directory? > > > On Tue, Nov 23, 2010 at 2:05 AM, Divya <[email protected]> wrote: > >> Hi, >> >> I am able to get the results when I run the test classifier. >> Can view my results @ http://pastebin.com/D5ejTwEW >> >> Steps I followed >> 1)generate input data set with to train the classifier >> $ bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups >> -p >> examples/bin/work/20news-bydate/20news-bydat >> e-train -o examples/bin/work/20news-bydate/bayes-train-input -a >> org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 >> 2)Generate train input data set to test the classifier >> bin/mahout org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups -p >> examples/bin/work/20news-bydate/20news-bydate-test >> -o examples/bin/work/20news-bydate/bayes-test-input -a >> org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 >> 3)Train the classifier >> bin/mahout trainclassifier -i >> examples/bin/work/20news-bydate/bayes-train-input -o >> examples/bin/work/20news-bydate/bayes-model >> 4)Test the classifier >> bin/mahout testclassifier -m examples/bin/work/20news-bydate/bayes-model >> -d >> examples/bin/work/20news-bydate/bayes-test-input >> >> I have not passed any parameters except the required ones. >> >> But when I pass the other parameters like -type bayes -ng 3 -source hdfs >> >> I am not getting the expected results. >> Can any one please explain me the reason behind it. >> >> Thanks >> Regards, >> Divya >> >> >> -----Original Message----- >> From: Divya [mailto:[email protected]] >> Sent: Tuesday, November 23, 2010 1:40 PM >> To: '[email protected]' >> Subject: RE: classification example doubts >> >> I am following same steps >> But no success... >> >> -----Original Message----- >> From: Sreejith S [mailto:[email protected]] >> Sent: Friday, November 19, 2010 4:00 PM >> To: [email protected] >> Subject: Re: classification example doubts >> >> step 1 : U can provide ur own sample data set using the prepare20news >> example >> just provide ur input dir.This is to perform some normalization on each >> file.This is a must >> >> stpe2 : Train the classifier with the normalized list of files. >> u get a model dir which contains the trained data set in hdfs. >> >> step3 : Test the classifier >> By using the trained model and sample input u can test the classifier >> >> Regards >> Sreejith >> >> >> On Fri, Nov 19, 2010 at 1:15 PM, Divya <[email protected]> wrote: >> >> > for my first question u say we can put our own input documents in >> directory >> > that documents also should be of format similar to bayes-train-input. >> > If yes, then I generated my input data using PrepareTwentyNewsgroups. >> > And used that as my input for testclassifier >> > But didn't get expected results. >> > As I observed it didn't read my files I my input directory >> > I tried replacing one of the files of input directory with one of the >> files >> > of train-input directory >> > Still same result. >> > Why is it not reading my files? >> > >> > Results below : >> > >> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: >> > comp.sys.mac.hardware -121323.6282757108 547567.2698760114 >> > -0.2215684445551005 >> > 2 >> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.space >> > -189203.04544769705 547567.2698760114 -0.3455338838834164 >> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.motorcycles >> > -138625.2628242977 547567.2698760114 -0.25316572127418674 >> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: rec.autos >> > -136935.18434679657 547567.2698760114 -0.25007919917821886 >> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: comp.graphics >> > -161979.38306986375 547567.2698760114 -0.29581640828631267 >> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: >> talk.politics.misc >> > -159579.70032298338 547567.2698760114 -0.29143396455949216 >> > 10/11/19 10:45:12 INFO datastore.InMemoryBayesDatastore: sci.med >> > -183835.5334355675 547567.2698760114 -0.3357314133790253 >> > 10/11/19 10:45:12 INFO bayes.TestClassifier: >> > ======================================================= >> > Summary >> > ------------------------------------------------------- >> > Correctly Classified Instances : 0 ?% >> > Incorrectly Classified Instances : 0 ?% >> > Total Classified Instances : 0 >> > >> > ======================================================= >> > Confusion Matrix >> > ------------------------------------------------------- >> > a b c d e f g h i >> j >> > k l m n o p q r >> > s t <--Classified as >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 a = rec.sport.baseball >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 b = sci.crypt >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 c = rec.sport.hockey >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 d = talk.politics.guns >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 e = soc.religion.christian >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 f = sci.electronics >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 g = comp.os.ms-windows.misc >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 h = misc.forsale >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 i = talk.religion.misc >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 j = alt.atheism >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 k = comp.windows.x >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 l = talk.politics.mideast >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 m = comp.sys.ibm.pc.hardware >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 n = comp.sys.mac.hardware >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 o = sci.space >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 p = rec.motorcycles >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 q = rec.autos >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 r = comp.graphics >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 s = talk.politics.misc >> > 0 0 0 0 0 0 0 0 0 >> 0 >> > 0 0 0 0 0 0 0 0 >> > 0 0 | 0 t = sci.med >> > Default Category: unknown: 20 >> > >> > >> > 10/11/19 10:45:12 INFO driver.MahoutDriver: Program took 5485 ms >> > >> > Am I missing anything . >> > >> > >> > Come to my second question, that means we are testing the classifier >> > against >> > our inputs itself. >> > Still I didn't understand. >> > What I understood about classification is we have set of documents which >> > will act as model for classification of new documents in the system. >> > Am I right? >> > Doesn't Mahout works in same way ? >> > >> > Third question, yeah I am looking for Mahout's API for classification. >> > >> > >> > @ Jaganadh - Thanks for clearing my doubts >> > >> > Regards, >> > Divya >> > >> > >> > -----Original Message----- >> > From: JAGANADH G [mailto:[email protected]] >> > Sent: Friday, November 19, 2010 3:09 PM >> > To: [email protected] >> > Subject: Re: classification example doubts >> > >> > > >> > > 1) I want to know what should go in "bayes-test-input". >> > > >> > > >> > After preparing the 20news-group data for training you can separate some >> > documents for testing your classifier. >> > These documents should go to "bayes-test-input". >> > >> > Or ven you can put a new set of documets in the directory . >> > >> > >> > > 2) If we take Wikipedia example >> > > https://cwiki.apache.org/MAHOUT/wikipedia-bayes-example.html >> > > >> > > >> > > >> > > To trainclassifier We have used Wikipediainput to generate model . >> > > >> > > To test classifier again we used wikipediamodel as input and Wikipedia >> > > input >> > > as test documents directory. >> > > >> > > I didn't understand why are we doing so ? >> > > >> > > >> > >> > We are testing the classifier against the development set we used. >> > >> > >> > >> > > 3) Last thing I want to know that when we use run testclassifier >> > using >> > > command line we can see the output. >> > > >> > > How can we make use of this output? >> > > >> > >> > >> > Are you looking for Mahout API usgae for classification ? >> > >> > -- >> > ********************************** >> > JAGANADH G >> > http://jaganadhg.freeflux.net/blog >> > >> > >> >> >
