I am using mahout 0.9, which part of source code should I look? My problem is that I don’t know how to the sequence file without the label should be structured.
Do you have any hint? Il giorno 29/lug/2014, alle ore 15:24, vaibhav srivastava <[email protected]> ha scritto: > Hi, > If you want to create a test set and if you do not want to measure accuracy. > Then you can make an instance of claasifier and load your model on that > classifier and then can find the best score. > Look at navie bayes test code. > Hope this help. Thanks . > On 29 Jul 2014 12:53, "Luca Filipponi" <[email protected]> wrote: > >> Hi , I am trying to develop sentiment analysis on italian tweet from >> twitter using the naive bayes classifier, but I've some trouble. >> >> My idea was to classify a lot of tweet as positive, negative or neautral, >> and using that as training set for the Classifier. To do that I've wrote a >> sequence file, in the format <Text,Text>, where in the key there is >> /label/tweetID and in the key the text, and then the text of all the >> dataset is converted in tfidf vector, using mahout utilities. >> >> Then I'm using the command: >> >> ./mahout trainnb and ./mahout testnb to check the classifier, and the >> score is right (I've got nearly 100% because the test set is the same as >> the train set) >> >> My question is if I want to use a test set that is unlabeled how should it >> be created? because if the format isn't like: >> >> key = /label/ the classifier can't find the label and I've got an >> exception >> >> but in a new dataset, obviously this will be unlabeled because i need to >> classify that, so I don't know what put in the key of the sequence file. >> >> I've searched online for some example, but the only ones that I've found >> use the split command, on the original dataset, and then testing on part of >> that, but isn't my case. >> >> >> Every idea for developing a better sentiment analysis is welcome, thanks >> in advance for the help. >> >>
