Hi All
Last week, I took part in a hackathon for Alfresco, the open source
content management system, and as part of that were having a play with
integrating Sentement Analysis [1]. As the Standford CoreNLP has sentement
analysis built it, we first used that. Then I tried to use Apache OpenNLP
instead. This wasn't that easy, but ended up working better for our test
documents.
I figured it might be good to share my experiences, in case there's things
I could improve, or in case there's documentation / examples / etc that
could be improved!
So, first up, the approach. I couldn't find anything in the docs on
sentement analysis. So, I decided to try using the Document Classifier,
and feed it two categories to learn/predict on, positive and negative. Is
that the best route?
(I did find a 2016 GSOC project to add sentement analysis, but decided to
stick with just core OpenNLP code)
Next, I hit a snag - the code at
https://opennlp.apache.org/docs/1.9.1/manual/opennlp.html#tools.doccat
doesn't compile against 1.9.1. I've raised
https://issues.apache.org/jira/browse/OPENNLP-1237 for this.
Having guessed at the new API syntax, I then needed to feed in some
training data. Based on [2], I opted for the JHU amazon review data. Not
sure if there are better free datasets for English language sentiment?
Next snag - the data format. The JHU data isn't in the same format as the
training tool or PlainTextByLineStream expects. What's more, I couldn't
find any examples of an alternate DocumentSampleStream input or
ObjectStream<DocumentSample> in the manual. Is there one? Is there
anything else on writing your own? Should there be?
(I ended up writing one [4] in Groovy, which I'm fairly sure is non-ideal,
and probably could be much improved, suggestions welcome!)
Next challenge - TrainingParameters. Several blog posts I found on using
the DoccatFactory suggested a cutoff of 2 and iterations of 30. I couldn't
spot anything in the manual under Document Categorizer for parameters,
though other sections did have them. Did I miss it? Should there be
something in the manual?
Building the model was nice and quick, and getting predictions easy too,
which was good! However, with my (quite possibly wrong) plan of training
for two categories, Positive or Negative, I wasn't able to see how to get
a good "how much sentiment" out. I opted for just returning whichever
category was reported as best, with no score (since typically the two
categories came back with very similar scores, though one generally
slightly higher than the other). Is there a better way?
Finally, it did all work, and for our testing better than StanfordNLP, so
thanks everyone for the library :)
Thanks
Nick
[1] https://github.com/Alfresco/SentimentAnalysis
[2]
https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124
[3] http://www.cs.jhu.edu/~mdredze/datasets/sentiment/
[4]
https://github.com/Alfresco/SentimentAnalysis/blob/master/sentiment-analysis/src/main/groovy/JHUSentimentReader.groovy