Re: Corpora used for training OpenNLP english models

Jörn Kottmann Mon, 10 Nov 2014 00:41:34 -0800

On 11/05/2014 08:14 AM, Rodrigo Agerri wrote:

Hi Raj,


I believe that the NameFinder models were trained with MUC, but I am
not sure. In any case, if you are going to annotate a different domain
to that of MUC, you will better off annotating data for that domain
because supervised approaches do not adapt well when used in other
genres/domains.

The English name finder models are trained on MUC 6 / 7 plus somecorrections to solve

certain detection problems.

I suggest not to use MUC anymore because it is quite dated.

If you want to train name finder models which perform well I suggest tohave a look

at OntoNotes 4.0. We have support to train OpenNLP models directly on it.

The data is not free, we had to pay around 50 USD to get it.

There is now also a newer version 5.0:
https://catalog.ldc.upenn.edu/LDC2013T19

I guess the format of it didn't change to much, so there is a goodchance it runs

with the 4.0 parsing code.

HTH,
Jörn

Re: Corpora used for training OpenNLP english models

Reply via email to