RE: SentenceDetectorTrainer for Chinese and Japanese does not seem to recognize specific punctuation

Markus Jelsma Wed, 06 May 2020 03:22:42 -0700

Hello,

For anyone stumbling on this issue as well, i worked around the problem by 
transforming the characters to their Latin script counterpart prior to training 
and detection. It is not ideal, but it works.


Regards,
Markus
 
-----Original message-----
> From:Markus Jelsma <markus.jel...@openindex.io>
> Sent: Sunday 19th April 2020 17:50
> To: users@opennlp.apache.org
> Subject: SentenceDetectorTrainer for Chinese and Japanese does not seem to 
> recognize specific punctuation
> 
> Hello,
> 
> Chinese and Japanese use different  punctuation characters but passing them 
> to the trainer tool (using -eosChars '。！？.!?') does not seem do anything, the 
> trained models have abysmal scores when using the SentenceDetectorEvaluator 
> tool.
> 
> When i transform the 。 to . in the training data using sed, and then train, 
> the models have acceptable scores.
> 
> I did notice the eosChars do not seem to end up well in the 
> manifest.properties file, it becomes:
> eosCharacters=\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD.\!?
> 
> When i manually update the file to list 。！？.!?, nothing changes.
> 
> What am i doing wrong?
> 
> Many thanks,
> Markus
>

RE: SentenceDetectorTrainer for Chinese and Japanese does not seem to recognize specific punctuation

Reply via email to