Hi Richard,

On Wed, Jan 25, 2017 at 5:16 PM, Richard Eckart de Castilho <r...@apache.org>
wrote:

> Hi all,
>
> I have tried training a model for the OpenNLP lemmatizer on the
> GUM 3.0 corpus. The accuracy of the model on a 80/20 split is
> about .83 (unless I calculated it wrong).
>

For English? That is very low. English lemmatizer models should be in the
high 90s.
In fact, I have just trained a perceptron lemmatizer myself dividing the
corpus in the first 50K words for training and the rest for test.
Evaluating it with the LemmatizerEvaluator CLI results in 98.08 accuracy.


>
> Now I wanted to use the model to create a simple test case on
> a standard sentence that we use everywhere:
>
> "We need a very complicated example sentence , which contains
> as many constituents and dependencies as possible ."
>
> When I run the lemmatizer on this sentence using my trained
> model, I get odd results like:
>
> * "constituents" is lemmatized as "constraint"
> * "dependency"   is lemmatized as "constituency"
>

When I run the OpenNLP model just trained these two words get lemmatized as
expected (constituent and dependency) I guess something went wrong in your
training process.

Corpus format: word pos lemma (tab separated)

I formatted gum from the dep formatted version

https://github.com/amir-zeldes/gum/tree/master/dep

Got columns 2, 5 and 3.

HTH,

Rodrigo

Reply via email to