We probably need to fix the detokenizer rules used for the German models
a bit to handle these cases correctly.

To train the tokenizer you either need proper training data, with whites spaces and split tags, or you have already tokenized data which you convert with the a
rule based detokenizer into training data.

The current implementation can't train the tokenizer with only white space
separated tokens because that does not generate proper training data for the
maxent trainer. Training with only <SPLIT> tags works, but apparently is not
really compatible with our feature geneartion which was not designed for that case.

I suggest to use our detokenizer to turn your tokenized text into training data.

Jörn

On 03/14/2013 10:49 AM, Andreas Niekler wrote:
Hello,


If you want to tokenize based on white spaces I suggest to use our white
space tokenizer.
No. I do not want to tokenize on whitespaces. I found out that the
de-token.bin model isn't capable of separating things like direct speech
in texts like Er sagte, dass "die neue. This end with a token "die. So i
got a clean 300k sentences sample from our german reference corpus which
is in the form of whitespace separated tokens in one sentence per line.
I just added this one to the TokenizerTraining Tool and endet up with an
exception because of only 1 Feature found. So i added all the <SPLIT>
tags like in the documentation and the training terminated without an
error. But with the undesired errors. So i surly need a model based
tokenizer because i also want to tokenize punctuations and so on. The
only thing i wasn't able to do is the training based on whitespace
separated sentences.

Thanks for your help

Andreas


Reply via email to