Building a sentement analyser - comments and questions!

Nick Burch Wed, 06 Feb 2019 11:09:45 -0800

Hi All

Last week, I took part in a hackathon for Alfresco, the open sourcecontent management system, and as part of that were having a play withintegrating Sentement Analysis [1]. As the Standford CoreNLP has sentementanalysis built it, we first used that. Then I tried to use Apache OpenNLPinstead. This wasn't that easy, but ended up working better for our testdocuments.

I figured it might be good to share my experiences, in case there's thingsI could improve, or in case there's documentation / examples / etc thatcould be improved!

So, first up, the approach. I couldn't find anything in the docs onsentement analysis. So, I decided to try using the Document Classifier,and feed it two categories to learn/predict on, positive and negative. Isthat the best route?

(I did find a 2016 GSOC project to add sentement analysis, but decided tostick with just core OpenNLP code)

Next, I hit a snag - the code athttps://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccatdoesn't compile against 1.9.1. I've raisedhttps://issues.apache.org/jira/browse/OPENNLP-1237 for this.

Having guessed at the new API syntax, I then needed to feed in sometraining data. Based on [2], I opted for the JHU amazon review data. Notsure if there are better free datasets for English language sentiment?

Next snag - the data format. The JHU data isn't in the same format as thetraining tool or PlainTextByLineStream expects. What's more, I couldn'tfind any examples of an alternate DocumentSampleStream input orObjectStream<DocumentSample> in the manual. Is there one? Is thereanything else on writing your own? Should there be?

(I ended up writing one [4] in Groovy, which I'm fairly sure is non-ideal,and probably could be much improved, suggestions welcome!)

Next challenge - TrainingParameters. Several blog posts I found on usingthe DoccatFactory suggested a cutoff of 2 and iterations of 30. I couldn'tspot anything in the manual under Document Categorizer for parameters,though other sections did have them. Did I miss it? Should there besomething in the manual?

Building the model was nice and quick, and getting predictions easy too,which was good! However, with my (quite possibly wrong) plan of trainingfor two categories, Positive or Negative, I wasn't able to see how to geta good "how much sentiment" out. I opted for just returning whichevercategory was reported as best, with no score (since typically the twocategories came back with very similar scores, though one generallyslightly higher than the other). Is there a better way?

Finally, it did all work, and for our testing better than StanfordNLP, sothanks everyone for the library :)


Thanks
Nick

[1] https://github.com/Alfresco/SentimentAnalysis
[2] 
https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124
[3] http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
[4] 
https://github.com/Alfresco/SentimentAnalysis/blob/master/sentiment-analysis/src/main/groovy/JHUSentimentReader.groovy

Building a sentement analyser - comments and questions!

Reply via email to