I think the usage information for the tokenizer trainer is misleading. You need to to do it like this: opennlp TokenizerTrainer -lang fra -encoding UTF8 -alphaNumOpt true -model "fr-token.bin" -data corpus.wtok.txt
Jörn On Mon, Feb 11, 2019 at 2:15 PM Dominik Terweh <d.ter...@drooms.com> wrote: > Dear Jörn, > > > > I am working together with Nikolai on the project, and the problem is that > we tried the alpha numeric optimisation, but it does not seem to change > anything: > > > > dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8 > -model "fr-token.bin" -data corpus.wtok.txt > > […] > > dominik@debian:~/annotated-corpora/tokenized$ opennlp TokenizerME > BUILD/fra/fr-token.bin > > Pontsmouth > > Ponts mouth > > dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8 > -alphaNumOpt isAlphaNumOpt -model "fr-token.bin" -data corpus.wtok.txt > > […] > > dominik@debian:~$ opennlp TokenizerME fr-token.bin > > Pontsmouth > > Ponts mouth > > > > Do you have any suggestions how this could be fixed? > > We also tried if maybe the model needs to be run with a flag or something, > but the TokenizerME did not accept any flag or similar like this. > > > > Best regards, > > Dominik > > > > On 2019/02/06 19:17:30, Joern Kottmann <k...@gmail.com> wrote: > > > Yes it is configurable. There is the so called alpha numeric> > > > optimisation, if this is set to true the tokenizer will not split> > > > between chars of the same category.> > > > > > > Jörn> > > > > > > On Thu, Jan 17, 2019 at 11:08 AM Nikolai Krot <ta...@gmail.com> wrote:> > > > >> > > > > Hallo OpenNLPists,> > > > >> > > > > We have trained a Word Tokenizer model for French on our own data and > see> > > > > weird cases where spitting occurs in the middle of a word, like this> > > > >> > > > > Portsmouth --> Ports mouth> > > > >> > > > > This is a word from the testing corpus that is normal French text > found on> > > > > the web, though the word itself is not in French.> > > > >> > > > > I wonder why the word tokenizer attempts to split *between* two > alphabetic> > > > > characters? I can imagine where splitting in the middle of a word can> > > > > indeed be useful, like in case of proclitics and enclitics, but I > would> > > > > like to perform the latter as an additional step, making the word > tokenizer> > > > > target only punctuation marks. Is it somehow configurable in OpenNLP?> > > > >> > > > > Best regards,> > > > > Nikolai> > > > > > Dominik Terweh > Praktikant > > *Drooms GmbH* > Eschersheimer Landstraße 6 > 60322 Frankfurt, Germany > www.drooms.com > Phone: > Fax: > Mail: d.ter...@drooms.com > > > <https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature> > > *Drooms GmbH*; Sitz der Gesellschaft / Registered Office: Eschersheimer > Landstr. 6, D-60322 Frankfurt am Main; Geschäftsführung / Management Board: > Alexandre Grellier; > Registergericht / Court of Registration: Amtsgericht Frankfurt am Main, > HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main, USt-IdNr.: > DE 224007190 >