Thanks Xiangrui. You have solved almost all my problems :)
On Wed, Jul 9, 2014 at 1:47 AM, Xiangrui Meng <men...@gmail.com> wrote: > 1) The feature dimension should be a fixed number before you run > NaiveBayes. If you use bag of words, you need to handle the > word-to-index dictionary by yourself. You can either ignore the words > that never appear in training (because they have no effect in > prediction), or use hashing to randomly project words to a fixed-sized > feature space (collision may happen). > > 3) Yes, we saved the log conditional probabilities. So to compute the > likelihood, we only need summation. > > Best, > Xiangrui > > On Tue, Jul 8, 2014 at 12:01 AM, Rahul Bhojwani > <rahulbhojwani2...@gmail.com> wrote: > > I am really sorry. Its actually my mistake. My problem 2 is wrong because > > using a single feature is a senseless thing. Sorry for the inconvenience. > > But still I will be waiting for the solutions for problem 1 and 3. > > > > Thanks, > > > > > > On Tue, Jul 8, 2014 at 12:14 PM, Rahul Bhojwani > > <rahulbhojwani2...@gmail.com> wrote: > >> > >> Hello, > >> > >> I am a novice.I want to classify the text into two classes. For this > >> purpose I want to use Naive Bayes model. I am using Python for it. > >> > >> Here are the problems I am facing: > >> > >> Problem 1: I wanted to use all words as features for the bag of words > >> model. Which means my features will be count of individual words. In > this > >> case whenever a new word comes in the test data (which was never > present in > >> the train data) I need to increase the size of the feature vector to > >> incorporate that word as well. Correct me if I am wrong. Can I do that > in > >> the present Mllib NaiveBayes. Or what is the way in which I can > incorporate > >> this? > >> > >> Problem 2: As I was not able to proceed with all words I did some > >> pre-processing and figured out few features from the text. But using > this > >> also is giving errors. > >> Right now I was testing for only one feature from the text that is count > >> of positive words. I am submitting the code below, along with the error: > >> > >> > >> #############Code > >> > >> import tokenizer > >> import gettingWordLists as gl > >> from pyspark.mllib.classification import NaiveBayes > >> from numpy import array > >> from pyspark import SparkContext, SparkConf > >> > >> conf = (SparkConf().setMaster("local[6]").setAppName("My > >> app").set("spark.executor.memory", "1g")) > >> > >> sc=SparkContext(conf = conf) > >> > >> # Getting the positive dict: > >> pos_list = [] > >> pos_list = gl.getPositiveList() > >> tok = tokenizer.Tokenizer(preserve_case=False) > >> > >> > >> train_data = [] > >> > >> with open("training_file.csv","r") as train_file: > >> for line in train_file: > >> tokens = line.split(",") > >> msg = tokens[0] > >> sentiment = tokens[1] > >> count = 0 > >> tokens = set(tok.tokenize(msg)) > >> for i in tokens: > >> if i.encode('utf-8') in pos_list: > >> count+=1 > >> if sentiment.__contains__('NEG'): > >> label = 0.0 > >> else: > >> label = 1.0 > >> feature = [] > >> feature.append(label) > >> feature.append(float(count)) > >> train_data.append(feature) > >> > >> > >> model = NaiveBayes.train(sc.parallelize(array(train_data))) > >> print model.pi > >> print model.theta > >> print "\n\n\n\n\n" , model.predict(array([5.0])) > >> > >> ############## > >> This is the output: > >> > >> [-2.24512292 -0.11195389] > >> [[ 0.] > >> [ 0.]] > >> > >> > >> > >> > >> > >> Traceback (most recent call last): > >> File "naive_bayes_analyser.py", line 77, in <module> > >> print "\n\n\n\n\n" , model.predict(array([5.0])) > >> File > >> "F:\spark-0.9.1\spark-0.9.1\python\pyspark\mllib\classification.py", > line > >> 101, in predict > >> return numpy.argmax(self.pi + dot(x, self.theta)) > >> ValueError: matrices are not aligned > >> > >> ############## > >> > >> Problem 3: As you can see the output for model.pi is -ve. That is prior > >> probabilities are negative. Can someone explain that also. Is it the > log of > >> the probability? > >> > >> > >> > >> Thanks, > >> -- > >> Rahul K Bhojwani > >> 3rd Year B.Tech > >> Computer Science and Engineering > >> National Institute of Technology, Karnataka > > > > > > > > > > -- > > Rahul K Bhojwani > > 3rd Year B.Tech > > Computer Science and Engineering > > National Institute of Technology, Karnataka > -- Rahul K Bhojwani 3rd Year B.Tech Computer Science and Engineering National Institute of Technology, Karnataka