You will need to write custom code to convert text from the file into vectors 
and then to use these vectors to talk to the pre-built model.

~Sarang

-----Original Message-----
From: vybe3142 [mailto:[email protected]] 
Sent: Friday, February 01, 2013 1:29 PM
To: [email protected]
Subject: How to classifyan individual file after training

1. Index the training data that I've pre-classified manually . Then perform 
training and testing. Everything works fine to this point 
/home/me/data/reuters-21578-example
├── chrysler (dir with training files)
├── cocoa (dir with training files)
├── egypt (dir with training files)
└── england (dir with training files)


mahout seqdirectory -i /home/me/data/reuters-21578-example -o
reuters-out-seqdir -c UTF-8 -chunk 5
mahout seq2sparse -i reuters-out-seqdir/ -o reuters-out-seqdir-sparse -lnorm
-nv -wt tfidf
mahout split -i reuters-out-seqdir-sparse/tfidf-vectors --trainingOutput
train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite
--sequenceFiles -xm sequential
                  
mahout trainnb -i train-vectors -el -o model -li labelindex -ow 
mahout testnb -i test-vectors -m model -l labelindex -ow -o testing
This seems to work (looking at the confusion matrix  even though these are
plan old text snippets as opposed to newsgroup text articles. 

2. At this point, I want to classify individual files that are not part of
the training set. I've tried a bunch of things that don't seem to work. 
For example, .. I try to invoke main() on TestNewsGroups.java with the args 

--input /home/me/data/reuters-21578 --model
/home/me/test/mahout/quickstart-classifier/model/naiveBayesModel.bin

and end up with an Exception 
Exception in thread "main" java.io.UTFDataFormatException: malformed input
around byte 5
        at java.io.DataInputStream.readUTF(DataInputStream.java:617)
        at java.io.DataInputStream.readUTF(DataInputStream.java:547)
        at
org.apache.mahout.classifier.sgd.PolymorphicWritable.read(PolymorphicWritable.java:41)
        at
org.apache.mahout.classifier.sgd.ModelSerializer.readBinary(ModelSerializer.java:69)
        at 
com.memonews.mahout.sentiment.TestNewsGroups.run(TestNewsGroups.java:67)
        at
com.memonews.mahout.sentiment.TestNewsGroups.main(TestNewsGroups.java:59)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at com.intellij.rt.execution.application.AppMain.main(AppMain.java:120)

Any idea what I can do to fix this? Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-classifyan-individual-file-after-training-tp4038036.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Reply via email to