Your first confusion matrix looks too good to be true, which tells that there can be a target leak or some other problem in the model.
I wanted to suggest some ModelDissector which you can use for analyzing the NaiveBayes model, however I just came to know that the ModelDissector in Mahout does not work for NaiveBayes, its only for SGD, you might want to read this http://lucene.472066.n3.nabble.com/Training-vectors-for-classification-td2080345.html . As Ted suggests in the discussion on the link I have shared, creating a Model Dissector for Naive Bayes would be a good place to contribute to Mahout, and you can also use it for solving your problem. On Wed, Oct 17, 2012 at 1:18 PM, Priyadarshan Raj <[email protected] > wrote: > Hi paritosh, > > As suggested by you I ran seq2sparse with arguments:- > > bin/mahout seq2sparse -i ${user-dir}/fact-seq -o ${user-dir}/fact-vectors > -lnorm -nv -wt tfidf --maxDFSigma 3.0 --maxDFPercent 100 --minSupport 5 > > but still I am getting the same result.. > > As suggested by you to use -analyzerName.but i think mahout itself uses > "DefaultAnalyzer" which by default > assign StandardAnalyzer(Version.LUCENE_36) as its analyzer..I think there > is a problem in seq2sparse command..When I am creating a training and > testing set from the same set of vectors using "split" command then on > using "testnb" on test set I am getting correct confusion matrix.But when i > am separately creating vectors from the subset of training data then i am > getting that "vertically aligned " entirely wrong confusion matrix. > Thanks > > On Tue, Oct 16, 2012 at 6:39 PM, paritosh ranjan > <[email protected]>wrote: > > > I am not an expert of Mahout's Naive Bayes, but since everyone seems to > be > > busy, I would like to point you towards certain things that you have not > > tried yet and might want to try. > > > > Try out > > > > --maxDFPercent, --minSupport Both of these options drop terms that are > > either too frequent (max) or not frequent enough across the collection of > > documents Useful in automatically dropping common or very infrequent > terms > > that add little value to the calculation > > in seq2sparse command. > > > > Also try > > > > --analyzerName An Apache Lucene analyzer class that can be used to > > tokenize, stem, remove, or otherwise change the words in the document > > > > to get rid of common words and also to stem words so that similar words > are > > converted into same form. > > > > Creating a model is lot of try and test in my opinion. I will suggest to > > explore different parameters provided in each mahout command. I am sure > you > > will be able to move ahead. > > > > Good luck. > > > > On Tue, Oct 16, 2012 at 12:47 PM, rdarshan <[email protected] > > >wrote: > > > > > Hi, > > > I am working on sentiment analysis of tweets. > > > I am using mahout naive bayes classifier for it.I am making a directory > > > "data".Inside "data" I am making three more directories named > > > "positive","negative","uncertain"..Then I kept 151 files(total 151Mb) > on > > > each of these positive,negatie and uncertain directory..Then I kept the > > > data > > > directory in hdfs..below are the set of command i ran to generate the > > model > > > and labelindex out of it. > > > > > > bin/mahout seqdirectory -i ${WORK_DIR}/data -o ${WORK_DIR}/data-seq > > > bin/mahout seq2sparse -i ${WORK_DIR}/data-seq -o > > > ${WORK_DIR}/data-vectors > > > -lnorm -nv -wt tfidf > > > bin/mahout split -i ${WORK_DIR}/data-vectors/tfidf-vectors > > > --trainingOutput ${WORK_DIR}/data-train-vectors --testOutput > > > ${WORK_DIR}/data-test-vectors --randomSelectionPct 40 --overwrite > > > --sequenceFiles -xm sequential > > > bin/mahout trainnb -i ${WORK_DIR}/data-train-vectors -el -o > > > ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow $c > > > > > > I am getting the confusion matrix after testing on the same set of > data > > > using "testnb" command as given below: > > > > > > bin/mahout testnb -i ${WORK_DIR}/data-train-vectors -m > > ${WORK_DIR}/model > > > -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data-testing $c > > > > > > Confusion Matrix > > > ------------------------------------------------------- > > > a b c <--Classified as > > > 151 0 0 | 151 a = negative > > > 0 151 0 | 151 b = positive > > > 0 0 151 | 151 c = uncertain > > > > > > > > > Then I created a some another directory "data2" in the same way and put > > > some > > > random data(which is a sub set of the training data(30 files(total size > > > 30MB) each)) in positive,negative,uncertain directory inside it .Then i > > > created a vector out of it using the "seq2sparse" command given below > :- > > > > > > bin/mahout seqdirectory -i ${WORK_DIR}/data2 -o ${WORK_DIR}/data2-seq > > > bin/mahout seq2sparse -i ${WORK_DIR}/data2-seq -o > > > ${WORK_DIR}/data2-vectors -lnorm -nv -wt tfidf > > > > > > On running the "testnb" using the model/lablelindex created from the > > > previous set of data using the command given below:- > > > > > > bin/mahout testnb -i > > ${WORK_DIR}/data2-vectors/tfidf-vectors/part-r-00000 > > > -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o > > > ${WORK_DIR}/data2-testing $c > > > > > > .I am getting confusion matrix like this. > > > > > > Confusion Matrix > > > ------------------------------------------------------- > > > a b c <--Classified as > > > 0 30 0 | 30 a = negative > > > 0 30 0 | 30 b = positive > > > 0 30 0 | 30 c = uncertain > > > > > > Can anyone tell me why this is coming.Am i using the correct way to > test > > > the > > > model or it is a bug in mahout 0.7.If it is not the correct way please > > > suggest a way out of it. > > > > > > > > > > > > -- > > > View this message in context: > > > > > > http://lucene.472066.n3.nabble.com/Using-model-of-mahout-0-7-tp4013891.html > > > Sent from the Mahout User List mailing list archive at Nabble.com. > > > > > >
