SentenceDetectorTrainer for Chinese and Japanese does not seem to recognize specific punctuation

Markus Jelsma Sun, 19 Apr 2020 08:51:21 -0700

Hello,

Chinese and Japanese use different  punctuation characters but passing them to 
the trainer tool (using -eosChars '。！？.!?') does not seem do anything, the 
trained models have abysmal scores when using the SentenceDetectorEvaluator 
tool.


When i transform the 。 to . in the training data using sed, and then train, the 
models have acceptable scores.

I did notice the eosChars do not seem to end up well in the manifest.properties 
file, it becomes:
eosCharacters=\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD.\!?

When i manually update the file to list 。！？.!?, nothing changes.

What am i doing wrong?

Many thanks,
Markus

SentenceDetectorTrainer for Chinese and Japanese does not seem to recognize specific punctuation

Reply via email to