Dear Jörn, I am working together with Nikolai on the project, and the problem is that we tried the alpha numeric optimisation, but it does not seem to change anything: dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8 -model "fr-token.bin" -data corpus.wtok.txt […] dominik@debian:~/annotated-corpora/tokenized$ opennlp TokenizerME BUILD/fra/fr-token.bin Pontsmouth Ponts mouth dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8 -alphaNumOpt isAlphaNumOpt -model "fr-token.bin" -data corpus.wtok.txt […] dominik@debian:~$ opennlp TokenizerME fr-token.bin Pontsmouth Ponts mouth
Do you have any suggestions how this could be fixed? We also tried if maybe the model needs to be run with a flag or something, but the TokenizerME did not accept any flag or similar like this. Best regards, Dominik On 2019/02/06 19:17:30, Joern Kottmann <k...@gmail.com<mailto:k...@gmail.com>> wrote: > Yes it is configurable. There is the so called alpha numeric> > optimisation, if this is set to true the tokenizer will not split> > between chars of the same category.> > > Jörn> > > On Thu, Jan 17, 2019 at 11:08 AM Nikolai Krot > <ta...@gmail.com<mailto:ta...@gmail.com>> wrote:> > >> > > Hallo OpenNLPists,> > >> > > We have trained a Word Tokenizer model for French on our own data and see> > > weird cases where spitting occurs in the middle of a word, like this> > >> > > Portsmouth --> Ports mouth> > >> > > This is a word from the testing corpus that is normal French text found on> > > the web, though the word itself is not in French.> > >> > > I wonder why the word tokenizer attempts to split *between* two alphabetic> > > characters? I can imagine where splitting in the middle of a word can> > > indeed be useful, like in case of proclitics and enclitics, but I would> > > like to perform the latter as an additional step, making the word tokenizer> > > target only punctuation marks. Is it somehow configurable in OpenNLP?> > >> > > Best regards,> > > Nikolai> > Dominik Terweh Praktikant [cid:drooms_company_cf601d61-0bba-4d1a-b55d-a8580583d74c.png] Drooms GmbH Eschersheimer Landstraße 6 60322 Frankfurt, Germany www.drooms.com<http://www.drooms.com> Phone: Fax: Mail: d.ter...@drooms.com<mailto:d.ter...@drooms.com> [cid:email-signature_newslettersubscription002_98ca3744-55b8-4b69-a351-cae57e604420.jpg]<https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature> Drooms GmbH; Sitz der Gesellschaft / Registered Office: Eschersheimer Landstr. 6, D-60322 Frankfurt am Main; Geschäftsführung / Management Board: Alexandre Grellier; Registergericht / Court of Registration: Amtsgericht Frankfurt am Main, HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main, USt-IdNr.: DE 224007190