On 04/25/2013 03:15 PM, Svetoslav Marinov wrote:
What corpus has the English Coref module been trained on?
I contributed code to train on MUC data, but there are still a few
problems, with detecting
possible mentions in the training data. If you want to give that a try I
can help to get you started.
As far as I know the coref models have been trained on MUC data plus
some private data, but I am not
sure if that is correct.
Can someone provide some guidance into which language specific resources
(modulo Sentence splitters, tokenizers, POS tagger, Parser and NER) are
needed in order to get the coreference working for a new language. A
Wordnet? What else?
Input needs to be:
- Sentence splitted
- Tokenized
- Either full or shallow parse, depending on how you trained the coref model
If you don't have a wordnet dict for your language you can probably
disable to piece
of feature generation which uses it. I don't know how that will affect
the performance.
We will move the coref component to the sandbox for the next release and
hopefully get some help
to refactor it so it can be moved back to the tools package.
Having a second coref component, e.g. rule based would also be nice.
Jörn