Hi all,

I have tried training a model for the OpenNLP lemmatizer on the
GUM 3.0 corpus. The accuracy of the model on a 80/20 split is 
about .83 (unless I calculated it wrong).

Now I wanted to use the model to create a simple test case on
a standard sentence that we use everywhere:

"We need a very complicated example sentence , which contains
as many constituents and dependencies as possible ."

When I run the lemmatizer on this sentence using my trained
model, I get odd results like:

* "constituents" is lemmatized as "constraint"
* "dependency"   is lemmatized as "constituency"

The words "constituents" and "dependency" do not occur in the
GUM corpus, however, "constraint" and "constituency" do.

However, the IXA Lemmatizer using the English CoNLL 2009 model
produces sensible results.

My understanding is that the OpenNLP Lemmatizer is basically
the IXA Lemmatizer contributed to OpenNLP - is that correct?

Does it train on full tokens? 

Could I be I am doing something wrong during the training
or is the lemmatizer really very sensitive to out-of-dictionary
words?

Cheers,

-- Richard

Reply via email to