Re: TokenizerTrainer

Jörn Kottmann Thu, 14 Mar 2013 03:35:03 -0700

On 03/14/2013 11:27 AM, Andreas Niekler wrote:

Hello,

We probably need to fix the detokenizer rules used for the German models
a bit to handle these cases correctly.

Are those rules public somewhere so that i can edit them myself? I can
provide them to the community afterwars. Mostly characters like „“ are
not recognized by the tokenizer. I don't want to convert them before
tokenizing because we analyze things like direct speech and those
characters are a good indicator for that.

No for the German models I wrote some code to do the detokenizationwhich supporteda specific corpus. Anyway, this work then lead me to contribute thedetokenizer to OpenNLP.


There is one file for English:
https://github.com/apache/opennlp/tree/trunk/opennlp-tools/lang/en/tokenizer

We would be happy to receive a contribution for German.
Have a look at the documentation, there is a section about the detokenizer.

I suggest to use our detokenizer to turn your tokenized text into
training data.

Has the detokenizer a command line tool as well?


Yes, there is one. Have a look at the CLI help.

Jörn

Re: TokenizerTrainer

Reply via email to