On 07/18/2012 04:30 AM, Lance Norskog wrote:
Please use unencumbered training data for all future OpenNLP projects.

We of course would like to do that, but it is not that easy.
For coreference there is no good data set which is available
under some kind of Open Source license.

The only way to *fix* that is to produce your own training
data based on a text source which can be shared under an
OS license.

We started to work on making tooling to crowd source such annotations,
but we still need to do a lot to finish this. So any help in this area is very welcome.

What exactly does a coref training dataset have to include? What kind
of tagging or cross-referencing?

- Full or shallow parse
- Named Entities
- Linked mentions

Have a look at this thread:
http://mail-archives.apache.org/mod_mbox/opennlp-dev/201203.mbox/%[email protected]%3E

I proposed the new format there and then implemented it.

For OntoNotes we need to do some adaption to get it into something
you can use for training, e.g. filtering verb mentions, doing the parsing, etc.
If we get it trained nicely on this dataset it would be a good step forward.

Jörn

Reply via email to