Dear Jörn,

I am working together with Nikolai on the project, and the problem is that we 
tried the alpha numeric optimisation, but it does not seem to change anything:

dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8 -model 
"fr-token.bin" -data corpus.wtok.txt
[…]
dominik@debian:~/annotated-corpora/tokenized$ opennlp TokenizerME 
BUILD/fra/fr-token.bin
Pontsmouth
Ponts mouth
dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8 
-alphaNumOpt isAlphaNumOpt -model "fr-token.bin" -data corpus.wtok.txt
               […]
dominik@debian:~$ opennlp TokenizerME fr-token.bin
Pontsmouth
Ponts mouth

Do you have any suggestions how this could be fixed?
We also tried if maybe the model needs to be run with a flag or something, but 
the TokenizerME did not accept any flag or similar like this.

Best regards,
Dominik

On 2019/02/06 19:17:30, Joern Kottmann <k...@gmail.com<mailto:k...@gmail.com>> 
wrote:
> Yes it is configurable. There is the so called alpha numeric>
> optimisation, if this is set to true the tokenizer will not split>
> between chars of the same category.>
>
> Jörn>
>
> On Thu, Jan 17, 2019 at 11:08 AM Nikolai Krot 
> <ta...@gmail.com<mailto:ta...@gmail.com>> wrote:>
> >>
> > Hallo OpenNLPists,>
> >>
> > We have trained a Word Tokenizer model for French on our own data and see>
> > weird cases where spitting occurs in the middle of a word, like this>
> >>
> > Portsmouth --> Ports mouth>
> >>
> > This is a word from the testing corpus that is normal French text found on>
> > the web, though the word itself is not in French.>
> >>
> > I wonder why the word tokenizer attempts to split *between* two alphabetic>
> > characters? I can imagine where splitting in the middle of a word can>
> > indeed be useful, like in case of proclitics and enclitics, but I would>
> > like to perform the latter as an additional step, making the word tokenizer>
> > target only punctuation marks. Is it somehow configurable in OpenNLP?>
> >>
> > Best regards,>
> > Nikolai>
>

Dominik Terweh
Praktikant

[cid:drooms_company_cf601d61-0bba-4d1a-b55d-a8580583d74c.png]

Drooms GmbH
Eschersheimer Landstraße 6
60322 Frankfurt, Germany
www.drooms.com<http://www.drooms.com>

Phone:
Fax:
Mail:   d.ter...@drooms.com<mailto:d.ter...@drooms.com>

[cid:email-signature_newslettersubscription002_98ca3744-55b8-4b69-a351-cae57e604420.jpg]<https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature>

Drooms GmbH; Sitz der Gesellschaft / Registered Office: Eschersheimer Landstr. 
6, D-60322 Frankfurt am Main; Geschäftsführung / Management Board: Alexandre 
Grellier;
Registergericht / Court of Registration: Amtsgericht Frankfurt am Main, HRB 
76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main, USt-IdNr.: DE 
224007190

Reply via email to