There are things you should know.

   1. Seq2sparse combines train and test file to create a single
   dictionary. If you have a new file you need to create the vectors from
   text, you need to reuse that dictionary. Other wise the ids in the vector
   that is created by that program will be different if you run seq2sparse
   only on a new dataset. So I would recommend staying away from it.
   2. First you should re-run this experiment using seq2encoded. This
   program uses a hash function(murmur2)  to encode the text to vectors. So if
   you re-run using a new dataset it will create a consistent vector.
   3. Once thats done, run seq2encoded on a directory of text documents
   that are not seen (which includes your message.txt among others). and run
   testnb on it.


Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Mon, Apr 15, 2013 at 7:09 PM, Brian Feeny <[email protected]> wrote:

> I am using Mahout version .7
>
> I have used the complementary naive bayes classifier to classify basic
> spam/ham messages like so:
>
> Copy easy_ham and spam directories into 20news-all:
>  cp -R easy_ham/ spam/ 20news-all/
>
> Copy 20news-all to HDFS:
> hadoop fs -put 20news-all
>
> Prepare data by sequencing into vectors:
>  mahout seqdirectory -i 20news-all -o 20news-seq
>  mahout seq2sparse -i 20news-seq -o 20news-vectors  -lnorm -nv  -wt tfidf
>
> Split data into train and test sets with 20% of the data being used for
> test and 80% for train:
> mahout split -i 20news-vectors/tfidf-vectors --trainingOutput
> 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct
> 20 --overwrite --sequenceFiles -xm sequential
>
> Build the model:
> mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c
>
> You can test the model against the training set:
> mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o
> 20news-testing-train -c
>
> Now test against the test set:
> mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o
> 20news-testing-test -c
>
>
> This all works fine, I get good results with my Confusion Matrix output.
>
> Now what if I have a message called message.txt.  How would I pass this to
> my data model to see if it classifies it as spam or ham?
>
>
>

Reply via email to