On 25.01.2017, at 18:38, Rodrigo Agerri <rage...@apache.org> wrote: > > When I run the OpenNLP model just trained these two words get lemmatized as > expected (constituent and dependency) I guess something went wrong in your > training process.
I figured out what went wrong. I implemented my own LemmaSampleStream and inside that, I didn't call getShortestEditScript(). I also didn't decode the output explicitly. The OpenNLP Lemmatizer API doesn't really indicate that these extra steps are necessary. I remembered to have done them when wrapping the original IXA implementation, but given the API design in OpenNLP, it had appeared to me that this was no longer necessary with the OpenNLP implementation. The Lemmatizer interface only has a lemmatize() method - the decode() method is only available in LemmatizerME. Also the LemmaSample JavaDoc doesn't indicate at all that the lemmas need to be encoded. IMHO it would be much less confusing to the use if the LemmatizerME.train() would internally do the encoding and if the lemmatize() method would internally do the decoding. Anyway, the accuracy is now much better. Thanks for the tip! -- Richard