Re: Error and doubts in using Mllib Naive bayes for text clasification

Rahul Bhojwani Tue, 08 Jul 2014 13:34:28 -0700

Thanks Xiangrui. You have solved almost all my problems :)


On Wed, Jul 9, 2014 at 1:47 AM, Xiangrui Meng <men...@gmail.com> wrote:

> 1) The feature dimension should be a fixed number before you run
> NaiveBayes. If you use bag of words, you need to handle the
> word-to-index dictionary by yourself. You can either ignore the words
> that never appear in training (because they have no effect in
> prediction), or use hashing to randomly project words to a fixed-sized
> feature space (collision may happen).
>
> 3) Yes, we saved the log conditional probabilities. So to compute the
> likelihood, we only need summation.
>
> Best,
> Xiangrui
>
> On Tue, Jul 8, 2014 at 12:01 AM, Rahul Bhojwani
> <rahulbhojwani2...@gmail.com> wrote:
> > I am really sorry. Its actually my mistake. My problem 2 is wrong because
> > using a single feature is a senseless thing. Sorry for the inconvenience.
> > But still I will be waiting for the solutions for problem 1 and 3.
> >
> > Thanks,
> >
> >
> > On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani
> > <rahulbhojwani2...@gmail.com> wrote:
> >>
> >> Hello,
> >>
> >> I am a novice.I want to classify the text into two classes. For this
> >> purpose I  want to use Naive Bayes model. I am using Python for it.
> >>
> >> Here are the problems I am facing:
> >>
> >> Problem 1: I wanted to use all words as features for the bag of words
> >> model. Which means my features will be count of individual words. In
> this
> >> case whenever a new word comes in the test data (which was never
> present in
> >> the train data) I need to increase the size of the feature vector to
> >> incorporate that word as well. Correct me if I am wrong. Can I do that
> in
> >> the present Mllib NaiveBayes. Or what is the way in which I can
> incorporate
> >> this?
> >>
> >> Problem 2: As I was not able to proceed with all words I did some
> >> pre-processing and figured out few features from the text. But using
> this
> >> also is giving errors.
> >> Right now I was testing for only one feature from the text that is count
> >> of positive words. I am submitting the code below, along with the error:
> >>
> >>
> >> #############Code
> >>
> >> import tokenizer
> >> import gettingWordLists as gl
> >> from pyspark.mllib.classification import NaiveBayes
> >> from numpy import array
> >> from pyspark import SparkContext, SparkConf
> >>
> >> conf = (SparkConf().setMaster("local[6]").setAppName("My
> >> app").set("spark.executor.memory", "1g"))
> >>
> >> sc=SparkContext(conf = conf)
> >>
> >> # Getting the positive dict:
> >> pos_list = []
> >> pos_list = gl.getPositiveList()
> >> tok = tokenizer.Tokenizer(preserve_case=False)
> >>
> >>
> >> train_data  = []
> >>
> >> with open("training_file.csv","r") as train_file:
> >>     for line in train_file:
> >>         tokens = line.split(",")
> >>         msg = tokens[0]
> >>         sentiment = tokens[1]
> >>         count = 0
> >>         tokens = set(tok.tokenize(msg))
> >>         for i in tokens:
> >>             if i.encode('utf-8') in pos_list:
> >>                 count+=1
> >>         if sentiment.__contains__('NEG'):
> >>             label = 0.0
> >>         else:
> >>             label = 1.0
> >>         feature = []
> >>         feature.append(label)
> >>         feature.append(float(count))
> >>         train_data.append(feature)
> >>
> >>
> >> model = NaiveBayes.train(sc.parallelize(array(train_data)))
> >> print model.pi
> >> print model.theta
> >> print "\n\n\n\n\n" , model.predict(array([5.0]))
> >>
> >> ##############
> >> This is the output:
> >>
> >> [-2.24512292 -0.11195389]
> >> [[ 0.]
> >>  [ 0.]]
> >>
> >>
> >>
> >>
> >>
> >> Traceback (most recent call last):
> >>   File "naive_bayes_analyser.py", line 77, in <module>
> >>     print "\n\n\n\n\n" , model.predict(array([5.0]))
> >>   File
> >> "F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py",
> line
> >>  101, in predict
> >>     return numpy.argmax(self.pi + dot(x, self.theta))
> >> ValueError: matrices are not aligned
> >>
> >> ##############
> >>
> >> Problem 3: As you can see the output for model.pi is -ve. That is prior
> >> probabilities are negative. Can someone explain that also. Is it the
> log of
> >> the probability?
> >>
> >>
> >>
> >> Thanks,
> >> --
> >> Rahul K Bhojwani
> >> 3rd Year B.Tech
> >> Computer Science and Engineering
> >> National Institute of Technology, Karnataka
> >
> >
> >
> >
> > --
> > Rahul K Bhojwani
> > 3rd Year B.Tech
> > Computer Science and Engineering
> > National Institute of Technology, Karnataka
>



-- 
Rahul K Bhojwani
3rd Year B.Tech
Computer Science and Engineering
National Institute of Technology, Karnataka

Re: Error and doubts in using Mllib Naive bayes for text clasification

Reply via email to