Hi paritosh,
As suggested by you I ran seq2sparse with arguments:-
bin/mahout seq2sparse -i ${user-dir}/fact-seq -o ${user-dir}/fact-vectors
-lnorm -nv -wt tfidf --maxDFSigma 3.0 --maxDFPercent 100 --minSupport 5
but still I am getting the same result..
As suggested by you to use -analyzerName.but i think mahout itself uses
"DefaultAnalyzer" which by default
assign StandardAnalyzer(Version.LUCENE_36) as its analyzer..I think there
is a problem in seq2sparse command..When I am creating a training and
testing set from the same set of vectors using "split" command then on
using "testnb" on test set I am getting correct confusion matrix.But when i
am separately creating vectors from the subset of training data then i am
getting that "vertically aligned " entirely wrong confusion matrix.
Thanks
On Tue, Oct 16, 2012 at 6:39 PM, paritosh ranjan
<[email protected]>wrote:
> I am not an expert of Mahout's Naive Bayes, but since everyone seems to be
> busy, I would like to point you towards certain things that you have not
> tried yet and might want to try.
>
> Try out
>
> --maxDFPercent, --minSupport Both of these options drop terms that are
> either too frequent (max) or not frequent enough across the collection of
> documents Useful in automatically dropping common or very infrequent terms
> that add little value to the calculation
> in seq2sparse command.
>
> Also try
>
> --analyzerName An Apache Lucene analyzer class that can be used to
> tokenize, stem, remove, or otherwise change the words in the document
>
> to get rid of common words and also to stem words so that similar words are
> converted into same form.
>
> Creating a model is lot of try and test in my opinion. I will suggest to
> explore different parameters provided in each mahout command. I am sure you
> will be able to move ahead.
>
> Good luck.
>
> On Tue, Oct 16, 2012 at 12:47 PM, rdarshan <[email protected]
> >wrote:
>
> > Hi,
> > I am working on sentiment analysis of tweets.
> > I am using mahout naive bayes classifier for it.I am making a directory
> > "data".Inside "data" I am making three more directories named
> > "positive","negative","uncertain"..Then I kept 151 files(total 151Mb) on
> > each of these positive,negatie and uncertain directory..Then I kept the
> > data
> > directory in hdfs..below are the set of command i ran to generate the
> model
> > and labelindex out of it.
> >
> > bin/mahout seqdirectory -i ${WORK_DIR}/data -o ${WORK_DIR}/data-seq
> > bin/mahout seq2sparse -i ${WORK_DIR}/data-seq -o
> > ${WORK_DIR}/data-vectors
> > -lnorm -nv -wt tfidf
> > bin/mahout split -i ${WORK_DIR}/data-vectors/tfidf-vectors
> > --trainingOutput ${WORK_DIR}/data-train-vectors --testOutput
> > ${WORK_DIR}/data-test-vectors --randomSelectionPct 40 --overwrite
> > --sequenceFiles -xm sequential
> > bin/mahout trainnb -i ${WORK_DIR}/data-train-vectors -el -o
> > ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow $c
> >
> > I am getting the confusion matrix after testing on the same set of data
> > using "testnb" command as given below:
> >
> > bin/mahout testnb -i ${WORK_DIR}/data-train-vectors -m
> ${WORK_DIR}/model
> > -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data-testing $c
> >
> > Confusion Matrix
> > -------------------------------------------------------
> > a b c <--Classified as
> > 151 0 0 | 151 a = negative
> > 0 151 0 | 151 b = positive
> > 0 0 151 | 151 c = uncertain
> >
> >
> > Then I created a some another directory "data2" in the same way and put
> > some
> > random data(which is a sub set of the training data(30 files(total size
> > 30MB) each)) in positive,negative,uncertain directory inside it .Then i
> > created a vector out of it using the "seq2sparse" command given below :-
> >
> > bin/mahout seqdirectory -i ${WORK_DIR}/data2 -o ${WORK_DIR}/data2-seq
> > bin/mahout seq2sparse -i ${WORK_DIR}/data2-seq -o
> > ${WORK_DIR}/data2-vectors -lnorm -nv -wt tfidf
> >
> > On running the "testnb" using the model/lablelindex created from the
> > previous set of data using the command given below:-
> >
> > bin/mahout testnb -i
> ${WORK_DIR}/data2-vectors/tfidf-vectors/part-r-00000
> > -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o
> > ${WORK_DIR}/data2-testing $c
> >
> > .I am getting confusion matrix like this.
> >
> > Confusion Matrix
> > -------------------------------------------------------
> > a b c <--Classified as
> > 0 30 0 | 30 a = negative
> > 0 30 0 | 30 b = positive
> > 0 30 0 | 30 c = uncertain
> >
> > Can anyone tell me why this is coming.Am i using the correct way to test
> > the
> > model or it is a bug in mahout 0.7.If it is not the correct way please
> > suggest a way out of it.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Using-model-of-mahout-0-7-tp4013891.html
> > Sent from the Mahout User List mailing list archive at Nabble.com.
> >
>