Hello, For anyone stumbling on this issue as well, i worked around the problem by transforming the characters to their Latin script counterpart prior to training and detection. It is not ideal, but it works.
Regards, Markus -----Original message----- > From:Markus Jelsma <markus.jel...@openindex.io> > Sent: Sunday 19th April 2020 17:50 > To: users@opennlp.apache.org > Subject: SentenceDetectorTrainer for Chinese and Japanese does not seem to > recognize specific punctuation > > Hello, > > Chinese and Japanese use different punctuation characters but passing them > to the trainer tool (using -eosChars '。!?.!?') does not seem do anything, the > trained models have abysmal scores when using the SentenceDetectorEvaluator > tool. > > When i transform the 。 to . in the training data using sed, and then train, > the models have acceptable scores. > > I did notice the eosChars do not seem to end up well in the > manifest.properties file, it becomes: > eosCharacters=\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD\uFFFD.\!? > > When i manually update the file to list 。!?.!?, nothing changes. > > What am i doing wrong? > > Many thanks, > Markus >