Hi all, I have tried training a model for the OpenNLP lemmatizer on the GUM 3.0 corpus. The accuracy of the model on a 80/20 split is about .83 (unless I calculated it wrong).
Now I wanted to use the model to create a simple test case on a standard sentence that we use everywhere: "We need a very complicated example sentence , which contains as many constituents and dependencies as possible ." When I run the lemmatizer on this sentence using my trained model, I get odd results like: * "constituents" is lemmatized as "constraint" * "dependency" is lemmatized as "constituency" The words "constituents" and "dependency" do not occur in the GUM corpus, however, "constraint" and "constituency" do. However, the IXA Lemmatizer using the English CoNLL 2009 model produces sensible results. My understanding is that the OpenNLP Lemmatizer is basically the IXA Lemmatizer contributed to OpenNLP - is that correct? Does it train on full tokens? Could I be I am doing something wrong during the training or is the lemmatizer really very sensitive to out-of-dictionary words? Cheers, -- Richard