I am not an expert of Mahout's Naive Bayes, but since everyone seems to be
busy, I would like to point you towards certain things that you have not
tried yet and might want to try.

Try out

--maxDFPercent, --minSupport Both of these options drop terms that are
either too frequent (max) or not frequent enough across the collection of
documents Useful in automatically dropping common or very infrequent terms
that add little value to the calculation
in seq2sparse command.

Also try

--analyzerName An Apache Lucene analyzer class that can be used to
tokenize, stem, remove, or otherwise change the words in the document

to get rid of common words and also to stem words so that similar words are
converted into same form.

Creating a model is lot of try and test in my opinion. I will suggest to
explore different parameters provided in each mahout command. I am sure you
will be able to move ahead.

Good luck.

On Tue, Oct 16, 2012 at 12:47 PM, rdarshan <[email protected]>wrote:

> Hi,
> I am working on sentiment analysis of tweets.
> I am using mahout naive bayes classifier for it.I am making a directory
> "data".Inside "data" I am making  three more directories named
> "positive","negative","uncertain"..Then I kept 151 files(total 151Mb) on
> each of these positive,negatie and uncertain directory..Then I kept the
> data
> directory in hdfs..below are the set of command i ran to generate the model
> and labelindex out of it.
>
> bin/mahout seqdirectory -i ${WORK_DIR}/data  -o ${WORK_DIR}/data-seq
> bin/mahout seq2sparse   -i ${WORK_DIR}/data-seq  -o
> ${WORK_DIR}/data-vectors
> -lnorm -nv  -wt tfidf
> bin/mahout split -i ${WORK_DIR}/data-vectors/tfidf-vectors
> --trainingOutput ${WORK_DIR}/data-train-vectors --testOutput
> ${WORK_DIR}/data-test-vectors  --randomSelectionPct 40 --overwrite
> --sequenceFiles -xm sequential
> bin/mahout trainnb -i ${WORK_DIR}/data-train-vectors -el -o
> ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow $c
>
>  I am getting the confusion matrix after testing on the same set of data
> using "testnb" command as given below:
>
> bin/mahout testnb  -i ${WORK_DIR}/data-train-vectors  -m ${WORK_DIR}/model
> -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/data-testing $c
>
> Confusion Matrix
> -------------------------------------------------------
> a       b       c       <--Classified as
> 151    0        0        |  151         a     = negative
> 0    151        0        |  151         b     = positive
> 0       0       151    |  151           c     = uncertain
>
>
> Then I created a some another directory "data2" in the same way and put
> some
> random data(which is a sub set of the training data(30 files(total size
> 30MB) each)) in positive,negative,uncertain directory inside it .Then i
> created a vector out of it using the "seq2sparse" command given below :-
>
> bin/mahout seqdirectory -i ${WORK_DIR}/data2  -o ${WORK_DIR}/data2-seq
> bin/mahout seq2sparse   -i ${WORK_DIR}/data2-seq  -o
> ${WORK_DIR}/data2-vectors  -lnorm -nv  -wt tfidf
>
> On  running the "testnb" using the model/lablelindex created from the
> previous set of data using the command given below:-
>
> bin/mahout testnb  -i ${WORK_DIR}/data2-vectors/tfidf-vectors/part-r-00000
> -m ${WORK_DIR}/model  -l ${WORK_DIR}/labelindex -ow -o
> ${WORK_DIR}/data2-testing $c
>
> .I am getting confusion matrix like this.
>
> Confusion Matrix
> -------------------------------------------------------
> a       b       c       <--Classified as
> 0     30        0       |  30           a     = negative
> 0     30        0       | 30            b     = positive
> 0     30      0      |  30      c     = uncertain
>
> Can anyone tell me why this is coming.Am i using the correct way to test
> the
> model or it is a bug in mahout 0.7.If it is not the correct way please
> suggest a way out of it.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Using-model-of-mahout-0-7-tp4013891.html
> Sent from the Mahout User List mailing list archive at Nabble.com.
>

Reply via email to