Hi All

Last week, I took part in a hackathon for Alfresco, the open source content management system, and as part of that were having a play with integrating Sentement Analysis [1]. As the Standford CoreNLP has sentement analysis built it, we first used that. Then I tried to use Apache OpenNLP instead. This wasn't that easy, but ended up working better for our test documents.

I figured it might be good to share my experiences, in case there's things I could improve, or in case there's documentation / examples / etc that could be improved!


So, first up, the approach. I couldn't find anything in the docs on sentement analysis. So, I decided to try using the Document Classifier, and feed it two categories to learn/predict on, positive and negative. Is that the best route?

(I did find a 2016 GSOC project to add sentement analysis, but decided to stick with just core OpenNLP code)


Next, I hit a snag - the code at https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccat doesn't compile against 1.9.1. I've raised https://issues.apache.org/jira/browse/OPENNLP-1237 for this.


Having guessed at the new API syntax, I then needed to feed in some training data. Based on [2], I opted for the JHU amazon review data. Not sure if there are better free datasets for English language sentiment?


Next snag - the data format. The JHU data isn't in the same format as the training tool or PlainTextByLineStream expects. What's more, I couldn't find any examples of an alternate DocumentSampleStream input or ObjectStream<DocumentSample> in the manual. Is there one? Is there anything else on writing your own? Should there be?

(I ended up writing one [4] in Groovy, which I'm fairly sure is non-ideal, and probably could be much improved, suggestions welcome!)


Next challenge - TrainingParameters. Several blog posts I found on using the DoccatFactory suggested a cutoff of 2 and iterations of 30. I couldn't spot anything in the manual under Document Categorizer for parameters, though other sections did have them. Did I miss it? Should there be something in the manual?


Building the model was nice and quick, and getting predictions easy too, which was good! However, with my (quite possibly wrong) plan of training for two categories, Positive or Negative, I wasn't able to see how to get a good "how much sentiment" out. I opted for just returning whichever category was reported as best, with no score (since typically the two categories came back with very similar scores, though one generally slightly higher than the other). Is there a better way?


Finally, it did all work, and for our testing better than StanfordNLP, so thanks everyone for the library :)

Thanks
Nick

[1] https://github.com/Alfresco/SentimentAnalysis
[2] 
https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124
[3] http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
[4] 
https://github.com/Alfresco/SentimentAnalysis/blob/master/sentiment-analysis/src/main/groovy/JHUSentimentReader.groovy

Reply via email to