I am using Mahout version .7 I have used the complementary naive bayes classifier to classify basic spam/ham messages like so:
Copy easy_ham and spam directories into 20news-all: cp -R easy_ham/ spam/ 20news-all/ Copy 20news-all to HDFS: hadoop fs -put 20news-all Prepare data by sequencing into vectors: mahout seqdirectory -i 20news-all -o 20news-seq mahout seq2sparse -i 20news-seq -o 20news-vectors -lnorm -nv -wt tfidf Split data into train and test sets with 20% of the data being used for test and 80% for train: mahout split -i 20news-vectors/tfidf-vectors --trainingOutput 20news-train-vectors --testOutput 20news-test-vectors --randomSelectionPct 20 --overwrite --sequenceFiles -xm sequential Build the model: mahout trainnb -i 20news-train-vectors -el -o model -li labelindex -ow -c You can test the model against the training set: mahout testnb -i 20news-train-vectors -m model -l labelindex -ow -o 20news-testing-train -c Now test against the test set: mahout testnb -i 20news-test-vectors -m model -l labelindex -ow -o 20news-testing-test -c This all works fine, I get good results with my Confusion Matrix output. Now what if I have a message called message.txt. How would I pass this to my data model to see if it classifies it as spam or ham?
