Hi Richard, On Wed, Jan 25, 2017 at 5:16 PM, Richard Eckart de Castilho <r...@apache.org> wrote:
> Hi all, > > I have tried training a model for the OpenNLP lemmatizer on the > GUM 3.0 corpus. The accuracy of the model on a 80/20 split is > about .83 (unless I calculated it wrong). > For English? That is very low. English lemmatizer models should be in the high 90s. In fact, I have just trained a perceptron lemmatizer myself dividing the corpus in the first 50K words for training and the rest for test. Evaluating it with the LemmatizerEvaluator CLI results in 98.08 accuracy. > > Now I wanted to use the model to create a simple test case on > a standard sentence that we use everywhere: > > "We need a very complicated example sentence , which contains > as many constituents and dependencies as possible ." > > When I run the lemmatizer on this sentence using my trained > model, I get odd results like: > > * "constituents" is lemmatized as "constraint" > * "dependency" is lemmatized as "constituency" > When I run the OpenNLP model just trained these two words get lemmatized as expected (constituent and dependency) I guess something went wrong in your training process. Corpus format: word pos lemma (tab separated) I formatted gum from the dep formatted version https://github.com/amir-zeldes/gum/tree/master/dep Got columns 2, 5 and 3. HTH, Rodrigo