Re: TokenizerTrainer

Jörn Kottmann Thu, 14 Mar 2013 03:05:36 -0700

We probably need to fix the detokenizer rules used for the German models
a bit to handle these cases correctly.

To train the tokenizer you either need proper training data, with whitesspacesand split tags, or you have already tokenized data which you convertwith the a

rule based detokenizer into training data.

The current implementation can't train the tokenizer with only white space
separated tokens because that does not generate proper training data for the
maxent trainer. Training with only <SPLIT> tags works, but apparently is not

really compatible with our feature geneartion which was not designed forthat case.

I suggest to use our detokenizer to turn your tokenized text intotraining data.


Jörn

On 03/14/2013 10:49 AM, Andreas Niekler wrote:

Hello,

If you want to tokenize based on white spaces I suggest to use our white
space tokenizer.

No. I do not want to tokenize on whitespaces. I found out that the
de-token.bin model isn't capable of separating things like direct speech
in texts like Er sagte, dass "die neue. This end with a token "die. So i
got a clean 300k sentences sample from our german reference corpus which
is in the form of whitespace separated tokens in one sentence per line.
I just added this one to the TokenizerTraining Tool and endet up with an
exception because of only 1 Feature found. So i added all the <SPLIT>
tags like in the documentation and the training terminated without an
error. But with the undesired errors. So i surly need a model based
tokenizer because i also want to tokenize punctuations and so on. The
only thing i wasn't able to do is the training based on whitespace
separated sentences.

Thanks for your help

Andreas

Re: TokenizerTrainer

Reply via email to