You will need to write custom code to convert text from the file into vectors and then to use these vectors to talk to the pre-built model.
~Sarang -----Original Message----- From: vybe3142 [mailto:[email protected]] Sent: Friday, February 01, 2013 1:29 PM To: [email protected] Subject: How to classifyan individual file after training 1. Index the training data that I've pre-classified manually . Then perform training and testing. Everything works fine to this point /home/me/data/reuters-21578-example ├── chrysler (dir with training files) ├── cocoa (dir with training files) ├── egypt (dir with training files) └── england (dir with training files) mahout seqdirectory -i /home/me/data/reuters-21578-example -o reuters-out-seqdir -c UTF-8 -chunk 5 mahout seq2sparse -i reuters-out-seqdir/ -o reuters-out-seqdir-sparse -lnorm -nv -wt tfidf mahout split -i reuters-out-seqdir-sparse/tfidf-vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential mahout trainnb -i train-vectors -el -o model -li labelindex -ow mahout testnb -i test-vectors -m model -l labelindex -ow -o testing This seems to work (looking at the confusion matrix even though these are plan old text snippets as opposed to newsgroup text articles. 2. At this point, I want to classify individual files that are not part of the training set. I've tried a bunch of things that don't seem to work. For example, .. I try to invoke main() on TestNewsGroups.java with the args --input /home/me/data/reuters-21578 --model /home/me/test/mahout/quickstart-classifier/model/naiveBayesModel.bin and end up with an Exception Exception in thread "main" java.io.UTFDataFormatException: malformed input around byte 5 at java.io.DataInputStream.readUTF(DataInputStream.java:617) at java.io.DataInputStream.readUTF(DataInputStream.java:547) at org.apache.mahout.classifier.sgd.PolymorphicWritable.read(PolymorphicWritable.java:41) at org.apache.mahout.classifier.sgd.ModelSerializer.readBinary(ModelSerializer.java:69) at com.memonews.mahout.sentiment.TestNewsGroups.run(TestNewsGroups.java:67) at com.memonews.mahout.sentiment.TestNewsGroups.main(TestNewsGroups.java:59) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120) Any idea what I can do to fix this? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/How-to-classifyan-individual-file-after-training-tp4038036.html Sent from the Mahout User List mailing list archive at Nabble.com.
