On 25.01.2017, at 18:38, Rodrigo Agerri <rage...@apache.org> wrote:
> 
> When I run the OpenNLP model just trained these two words get lemmatized as
> expected (constituent and dependency) I guess something went wrong in your
> training process.

I figured out what went wrong. I implemented my own LemmaSampleStream and
inside that, I didn't call getShortestEditScript(). I also didn't decode
the output explicitly.

The OpenNLP Lemmatizer API doesn't really indicate that these extra steps
are necessary. I remembered to have done them when wrapping the original
IXA implementation, but given the API design in OpenNLP, it had appeared
to me that this was no longer necessary with the OpenNLP implementation.

The Lemmatizer interface only has a lemmatize() method - the decode() method
is only available in LemmatizerME. Also the LemmaSample JavaDoc doesn't
indicate at all that the lemmas need to be encoded.

IMHO it would be much less confusing to the use if the LemmatizerME.train()
would internally do the encoding and if the lemmatize() method would 
internally do the decoding. 

Anyway, the accuracy is now much better. Thanks for the tip!

-- Richard

Reply via email to