Finally I’ve implemented a Naive Bayes Classifier for Sentiment analysis and works quite good, but I’ve few questions.
The training phase creates a .bin file that is the model of the classifier, I’ve tried to read but I can’t. What does the .bin file represent? I’m asking this because I’d like to know better how the classifier works, where I can read something about its implementation? Thank in advance, your help was irrepleaceble to create my classifier. On 29 Jul 2014, at 18:40, vaibhav srivastava <[email protected]> wrote: > Hi Filipponi, > This case testnb will not work. As in the end part of it code its takes > label to print the confusion matrix. > > if you want to use your Model to predict what are the possible out come, > you have to use the class "TestNaiveBayesDriver.java" to write that. > > and comment this section /*if (bestIdx != Integer.MIN_VALUE) { > ClassifierResult classifierResult = new > ClassifierResult(labelMap.get(bestIdx), bestScore); > analyzer.addInstance(pair.getFirst().toString(), classifierResult); > } > */ > that case the output file of BayesTestMapper is the going to store values > for you if you can use seqdumper you can get the values for key > "471685156584292353". > or suppose > > Key: /471685156584292353/ Value:/471685156584292353/:{1: > 0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086 > NaiveBayesModel model =NaiveBayesModel.materialize(output, conf); // > output path of Model > classifier = new ComplementaryNaiveBayesClassifier(model); > classifier.classifyFull(vector); // this returns A vector of > probabilities in 1 of n-1 encoding for your label. input will be the vector > {1: > 0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086 > } > Thanks > Vaibhav. > > > > > > > > > On Tue, Jul 29, 2014 at 9:06 PM, Luca Filipponi <[email protected]> > wrote: > >> I appreciate your help, but for my lack of knowledge I didn't understand. >> >> I'll try to explain better my problem :D >> >> What I've done is to create a sequence File starting from csv like this ( >> is italian tweet :D ): >> >> negativo,471685156584292353, @beppe_grillo intanto .. Piangi tu ... Per >> adesso io rido !!!!! >> >> positivo,471685170698149888,RT @carlucci_cc: @valy_s renzie si preoccupa >> di chi gli garantisce voti...ma stanno scoprendo il prezzo di quei >> fottutissimi #80euro dagli ... >> >> neutrale,471685174426886144,Di #elezioni, di venditori di fumo e di altre >> schifezze... http://t.co/euFbtP7hQ1 ... #Europee2014 via >> >> So I create a sequence file in this way: >> >> >> String[] tokens = line.split(",", 3); >> >> String label = tokens[0]; >> String id = tokens[1]; >> String message = tokens[2]; >> key.set("/" + label + "/" + id); >> value.set(message); >> writer.append(key, value); >> >> >> So I'm creating a sequence File of the form <Text,Text> where the key is >> composed in this way : "/label/documentID/" and the value contains the >> original text of the document. >> >> After this step I create tfidf document using mahout utilities, then I've >> a sequence file Text,VectorWritable like this: >> >> Key: /negativo/468437278663409666 >> Value:/negativo/468437278663409666:{143:0.2884088933275849,233:0.2884088933275849,241:0.2772479861583959,309:0.22061363650715415} >> >> Then I am using the command on the newly created vector: >> >> ./mahout trainnb -i tfidf-vectors -el -li labelindex -o model -ow -c >> >> And then: >> >> ./mahout testnb -i tfidf-vector -m model -l labelindex -ow -o >> trainingVectorTest-result -c >> >> and this is the output: >> >> 14/07/25 15:44:04 INFO test.TestNaiveBayesDriver: Complementary Results: >> ======================================================= >> Summary >> ------------------------------------------------------- >> Correctly Classified Instances : 112 99,115% >> Incorrectly Classified Instances : 1 0,885% >> Total Classified Instances : 113 >> >> ======================================================= >> Confusion Matrix >> ------------------------------------------------------- >> a b c <--Classified as >> 47 0 0 | 47 a = negativo >> 0 41 0 | 41 b = neutrale >> 0 1 24 | 25 c = positivo >> >> ======================================================= >> Statistics >> ------------------------------------------------------- >> Kappa 0,9361 >> Accuracy 99,115% >> Reliability 74% >> Reliability (standard deviation) 0,4937 >> >> >> What I want to do now is to use the classifier on a new dataset that is >> unlabeled, so I've a csv like this: >> >> 471685156584292353,@beppe_grillo intanto .. Piangi tu ... Per adesso io >> rido !!!!! >> >> So I wrote a sequence file with: >> >> key= /documentid/ value= Content of the document >> >> and then use mahout utilities to create a tfidf-vector: >> >> Key: /471685156584292353/ >> Value:/471685156584292353/:{1:0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086 >> ... >> >> But when I use the command testnb on this new dataset I get this exception: >> >> java.lang.IllegalArgumentException: Label not found: 471685156584292353 >> >> I know that this is due, to the fact that the documentID is recognized as >> label, but I don't know how to resolve that, could be great if you provide >> me some similar example, becouse I can't find nothing similar. >> >> Thank you so much in advance, your help is really appreciated. >> >> Luca Filipponi. >> >> >> Il giorno 29/lug/2014, alle ore 16:43, vaibhav srivastava < >> [email protected]> ha scritto: >> >>> Hi >>> The sequence file format will be Text and Vector Writable. >>> suppose you have test document named as 1,2,3,4. >>> The you can have sequence file format as Key : /test/1 Value : <vectors1> >>> /test/2 Value : <vectors2> >>> >>> this line in BayesTestMapper >>> //the key is the expected value >>> >>> context.write(new Text(SLASH.split(key.toString())[1]), new >>> VectorWritable(result)); >>> >>> >>> and TestNaiveBayesDriver.java might help you . if you remove this part >> from >>> this code you will not get confusion matrix and initial labels are not >>> required. >>> >>> >>> >>> >>> if (bestIdx != Integer.MIN_VALUE) { >>> >>> ClassifierResult classifierResult = new ClassifierResult(labelMap >>> .get(bestIdx), bestScore); >>> >>> analyzer.addInstance(pair.getFirst().toString(), >> classifierResult); >>> >>> } >>> >>> >>> your out file will contain our document name suppose 1 and label vector >>> with its values. >>> >>> >>> hope this help. >>> >>> Thanks, >>> >>> Vaibhav >>> >>> [email protected] >>> >>> >>> >>> >>> On Tue, Jul 29, 2014 at 7:16 PM, Luca Filipponi < >> [email protected]> >>> wrote: >>> >>>> I am using mahout 0.9, which part of source code should I look? >>>> >>>> My problem is that I don't know how to the sequence file without the >> label >>>> should be structured. >>>> >>>> Do you have any hint? >>>> >>>> Il giorno 29/lug/2014, alle ore 15:24, vaibhav srivastava < >>>> [email protected]> ha scritto: >>>> >>>>> Hi, >>>>> If you want to create a test set and if you do not want to measure >>>> accuracy. >>>>> Then you can make an instance of claasifier and load your model on that >>>>> classifier and then can find the best score. >>>>> Look at navie bayes test code. >>>>> Hope this help. Thanks . >>>>> On 29 Jul 2014 12:53, "Luca Filipponi" <[email protected]> >>>> wrote: >>>>> >>>>>> Hi , I am trying to develop sentiment analysis on italian tweet from >>>>>> twitter using the naive bayes classifier, but I've some trouble. >>>>>> >>>>>> My idea was to classify a lot of tweet as positive, negative or >>>> neautral, >>>>>> and using that as training set for the Classifier. To do that I've >>>> wrote a >>>>>> sequence file, in the format <Text,Text>, where in the key there is >>>>>> /label/tweetID and in the key the text, and then the text of all the >>>>>> dataset is converted in tfidf vector, using mahout utilities. >>>>>> >>>>>> Then I'm using the command: >>>>>> >>>>>> ./mahout trainnb and ./mahout testnb to check the classifier, and the >>>>>> score is right (I've got nearly 100% because the test set is the same >> as >>>>>> the train set) >>>>>> >>>>>> My question is if I want to use a test set that is unlabeled how >> should >>>> it >>>>>> be created? because if the format isn't like: >>>>>> >>>>>> key = /label/ the classifier can't find the label and I've got an >>>>>> exception >>>>>> >>>>>> but in a new dataset, obviously this will be unlabeled because i need >> to >>>>>> classify that, so I don't know what put in the key of the sequence >> file. >>>>>> >>>>>> I've searched online for some example, but the only ones that I've >> found >>>>>> use the split command, on the original dataset, and then testing on >>>> part of >>>>>> that, but isn't my case. >>>>>> >>>>>> >>>>>> Every idea for developing a better sentiment analysis is welcome, >> thanks >>>>>> in advance for the help. >>>>>> >>>>>> >>>> >>>> >>> >>> >>> -- >>> Thanks and Regards, >>> Vaibhav Srivastava >>> Email-id: [email protected] >>> Mobile no.: 9552543029 >> >> > > > -- > Thanks and Regards, > Vaibhav Srivastava > Email-id: [email protected] > Mobile no.: 9552543029
