Hi , I am trying to develop sentiment analysis on italian tweet from twitter 
using the naive bayes classifier, but I’ve some trouble.

My idea was to classify a lot of tweet as positive, negative or neautral, and 
using that as training set for the Classifier. To do that I’ve wrote a sequence 
file, in the format <Text,Text>, where in the key there is  /label/tweetID and 
in the key the text, and then the text of all the dataset is converted in tfidf 
vector, using mahout utilities.

Then I’m using the command:

./mahout trainnb and ./mahout testnb to check the classifier, and the score is 
right (I’ve got nearly 100% because the test set is the same as the train set)

My question is if I want to use a test set that is unlabeled how should it be 
created? because if the format isn’t like:

key = /label/  the classifier can’t find the label and I’ve got an exception

but in a new dataset, obviously this will be unlabeled because i need to 
classify that, so I don’t know what put in the key of the sequence file.

I’ve searched online for some example, but the only ones that I’ve found use 
the split command, on the original dataset, and then testing on part of that, 
but isn’t my case.


Every idea for developing a better sentiment analysis is welcome, thanks in 
advance for the help.

Reply via email to