Hi Joern, This indeed solves the issue. Thank you very much for heads up!
Best regards, Nikolai On Tue, Feb 19, 2019 at 11:10 AM Joern Kottmann <kottm...@gmail.com> wrote: > I think the usage information for the tokenizer trainer is misleading. > > You need to to do it like this: > opennlp TokenizerTrainer -lang fra -encoding UTF8 -alphaNumOpt true -model > "fr-token.bin" -data corpus.wtok.txt > > Jörn > > On Mon, Feb 11, 2019 at 2:15 PM Dominik Terweh <d.ter...@drooms.com> > wrote: > > > Dear Jörn, > > > > > > > > I am working together with Nikolai on the project, and the problem is > that > > we tried the alpha numeric optimisation, but it does not seem to change > > anything: > > > > > > > > dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8 > > -model "fr-token.bin" -data corpus.wtok.txt > > > > […] > > > > dominik@debian:~/annotated-corpora/tokenized$ opennlp TokenizerME > > BUILD/fra/fr-token.bin > > > > Pontsmouth > > > > Ponts mouth > > > > dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8 > > -alphaNumOpt isAlphaNumOpt -model "fr-token.bin" -data corpus.wtok.txt > > > > […] > > > > dominik@debian:~$ opennlp TokenizerME fr-token.bin > > > > Pontsmouth > > > > Ponts mouth > > > > > > > > Do you have any suggestions how this could be fixed? > > > > We also tried if maybe the model needs to be run with a flag or > something, > > but the TokenizerME did not accept any flag or similar like this. > > > > > > > > Best regards, > > > > Dominik > > > > > > > > On 2019/02/06 19:17:30, Joern Kottmann <k...@gmail.com> wrote: > > > > > Yes it is configurable. There is the so called alpha numeric> > > > > > optimisation, if this is set to true the tokenizer will not split> > > > > > between chars of the same category.> > > > > > > > > > > Jörn> > > > > > > > > > > On Thu, Jan 17, 2019 at 11:08 AM Nikolai Krot <ta...@gmail.com> > wrote:> > > > > > >> > > > > > > Hallo OpenNLPists,> > > > > > >> > > > > > > We have trained a Word Tokenizer model for French on our own data and > > see> > > > > > > weird cases where spitting occurs in the middle of a word, like this> > > > > > >> > > > > > > Portsmouth --> Ports mouth> > > > > > >> > > > > > > This is a word from the testing corpus that is normal French text > > found on> > > > > > > the web, though the word itself is not in French.> > > > > > >> > > > > > > I wonder why the word tokenizer attempts to split *between* two > > alphabetic> > > > > > > characters? I can imagine where splitting in the middle of a word > can> > > > > > > indeed be useful, like in case of proclitics and enclitics, but I > > would> > > > > > > like to perform the latter as an additional step, making the word > > tokenizer> > > > > > > target only punctuation marks. Is it somehow configurable in > OpenNLP?> > > > > > >> > > > > > > Best regards,> > > > > > > Nikolai> > > > > > > > > > Dominik Terweh > > Praktikant > > > > *Drooms GmbH* > > Eschersheimer Landstraße 6 > > 60322 Frankfurt, Germany > > www.drooms.com > > Phone: > > Fax: > > Mail: d.ter...@drooms.com > > > > > > < > https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature > > > > > > *Drooms GmbH*; Sitz der Gesellschaft / Registered Office: Eschersheimer > > Landstr. 6, D-60322 Frankfurt am Main; Geschäftsführung / Management > Board: > > Alexandre Grellier; > > Registergericht / Court of Registration: Amtsgericht Frankfurt am Main, > > HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main, > USt-IdNr.: > > DE 224007190 > > >