On 03/14/2013 11:27 AM, Andreas Niekler wrote:
Hello,
We probably need to fix the detokenizer rules used for the German models
a bit to handle these cases correctly.
Are those rules public somewhere so that i can edit them myself? I can
provide them to the community afterwars. Mostly characters like „“ are
not recognized by the tokenizer. I don't want to convert them before
tokenizing because we analyze things like direct speech and those
characters are a good indicator for that.
No for the German models I wrote some code to do the detokenization
which supported
a specific corpus. Anyway, this work then lead me to contribute the
detokenizer to OpenNLP.
There is one file for English:
https://github.com/apache/opennlp/tree/trunk/opennlp-tools/lang/en/tokenizer
We would be happy to receive a contribution for German.
Have a look at the documentation, there is a section about the detokenizer.
I suggest to use our detokenizer to turn your tokenized text into
training data.
Has the detokenizer a command line tool as well?
Yes, there is one. Have a look at the CLI help.
Jörn