I appreciate your help, but for my lack of knowledge I didn’t understand.
I’ll try to explain better my problem :D What I’ve done is to create a sequence File starting from csv like this ( is italian tweet :D ): negativo,471685156584292353, @beppe_grillo intanto .. Piangi tu ... Per adesso io rido !!!!! positivo,471685170698149888,RT @carlucci_cc: @valy_s renzie si preoccupa di chi gli garantisce voti...ma stanno scoprendo il prezzo di quei fottutissimi #80euro dagli … neutrale,471685174426886144,Di #elezioni, di venditori di fumo e di altre schifezze... http://t.co/euFbtP7hQ1 … #Europee2014 via So I create a sequence file in this way: String[] tokens = line.split(",", 3); String label = tokens[0]; String id = tokens[1]; String message = tokens[2]; key.set("/" + label + "/" + id); value.set(message); writer.append(key, value); So I’m creating a sequence File of the form <Text,Text> where the key is composed in this way : “/label/documentID/“ and the value contains the original text of the document. After this step I create tfidf document using mahout utilities, then I’ve a sequence file Text,VectorWritable like this: Key: /negativo/468437278663409666 Value:/negativo/468437278663409666:{143:0.2884088933275849,233:0.2884088933275849,241:0.2772479861583959,309:0.22061363650715415} Then I am using the command on the newly created vector: ./mahout trainnb -i tfidf-vectors -el -li labelindex -o model -ow -c And then: ./mahout testnb -i tfidf-vector -m model -l labelindex -ow -o trainingVectorTest-result -c and this is the output: 14/07/25 15:44:04 INFO test.TestNaiveBayesDriver: Complementary Results: ======================================================= Summary ------------------------------------------------------- Correctly Classified Instances : 112 99,115% Incorrectly Classified Instances : 1 0,885% Total Classified Instances : 113 ======================================================= Confusion Matrix ------------------------------------------------------- a b c <--Classified as 47 0 0 | 47 a = negativo 0 41 0 | 41 b = neutrale 0 1 24 | 25 c = positivo ======================================================= Statistics ------------------------------------------------------- Kappa 0,9361 Accuracy 99,115% Reliability 74% Reliability (standard deviation) 0,4937 What I want to do now is to use the classifier on a new dataset that is unlabeled, so I’ve a csv like this: 471685156584292353,@beppe_grillo intanto .. Piangi tu ... Per adesso io rido !!!!! So I wrote a sequence file with: key= /documentid/ value= Content of the document and then use mahout utilities to create a tfidf-vector: Key: /471685156584292353/ Value:/471685156584292353/:{1:0.19424138174284086,24:0.19424138174284086,25:0.1810660431557166,44:0.19424138174284086,78:0.19424138174284086 ... But when I use the command testnb on this new dataset I get this exception: java.lang.IllegalArgumentException: Label not found: 471685156584292353 I know that this is due, to the fact that the documentID is recognized as label, but I don’t know how to resolve that, could be great if you provide me some similar example, becouse I can’t find nothing similar. Thank you so much in advance, your help is really appreciated. Luca Filipponi. Il giorno 29/lug/2014, alle ore 16:43, vaibhav srivastava <[email protected]> ha scritto: > Hi > The sequence file format will be Text and Vector Writable. > suppose you have test document named as 1,2,3,4. > The you can have sequence file format as Key : /test/1 Value : <vectors1> > /test/2 Value : <vectors2> > > this line in BayesTestMapper > //the key is the expected value > > context.write(new Text(SLASH.split(key.toString())[1]), new > VectorWritable(result)); > > > and TestNaiveBayesDriver.java might help you . if you remove this part from > this code you will not get confusion matrix and initial labels are not > required. > > > > > if (bestIdx != Integer.MIN_VALUE) { > > ClassifierResult classifierResult = new ClassifierResult(labelMap > .get(bestIdx), bestScore); > > analyzer.addInstance(pair.getFirst().toString(), classifierResult); > > } > > > your out file will contain our document name suppose 1 and label vector > with its values. > > > hope this help. > > Thanks, > > Vaibhav > > [email protected] > > > > > On Tue, Jul 29, 2014 at 7:16 PM, Luca Filipponi <[email protected]> > wrote: > >> I am using mahout 0.9, which part of source code should I look? >> >> My problem is that I don't know how to the sequence file without the label >> should be structured. >> >> Do you have any hint? >> >> Il giorno 29/lug/2014, alle ore 15:24, vaibhav srivastava < >> [email protected]> ha scritto: >> >>> Hi, >>> If you want to create a test set and if you do not want to measure >> accuracy. >>> Then you can make an instance of claasifier and load your model on that >>> classifier and then can find the best score. >>> Look at navie bayes test code. >>> Hope this help. Thanks . >>> On 29 Jul 2014 12:53, "Luca Filipponi" <[email protected]> >> wrote: >>> >>>> Hi , I am trying to develop sentiment analysis on italian tweet from >>>> twitter using the naive bayes classifier, but I've some trouble. >>>> >>>> My idea was to classify a lot of tweet as positive, negative or >> neautral, >>>> and using that as training set for the Classifier. To do that I've >> wrote a >>>> sequence file, in the format <Text,Text>, where in the key there is >>>> /label/tweetID and in the key the text, and then the text of all the >>>> dataset is converted in tfidf vector, using mahout utilities. >>>> >>>> Then I'm using the command: >>>> >>>> ./mahout trainnb and ./mahout testnb to check the classifier, and the >>>> score is right (I've got nearly 100% because the test set is the same as >>>> the train set) >>>> >>>> My question is if I want to use a test set that is unlabeled how should >> it >>>> be created? because if the format isn't like: >>>> >>>> key = /label/ the classifier can't find the label and I've got an >>>> exception >>>> >>>> but in a new dataset, obviously this will be unlabeled because i need to >>>> classify that, so I don't know what put in the key of the sequence file. >>>> >>>> I've searched online for some example, but the only ones that I've found >>>> use the split command, on the original dataset, and then testing on >> part of >>>> that, but isn't my case. >>>> >>>> >>>> Every idea for developing a better sentiment analysis is welcome, thanks >>>> in advance for the help. >>>> >>>> >> >> > > > -- > Thanks and Regards, > Vaibhav Srivastava > Email-id: [email protected] > Mobile no.: 9552543029
