Re: Tokenizing untokenizable (French)

Joern Kottmann Tue, 19 Feb 2019 02:10:51 -0800

I think the usage information for the tokenizer trainer is misleading.

You need to to do it like this:
opennlp TokenizerTrainer -lang fra -encoding UTF8 -alphaNumOpt true -model
"fr-token.bin" -data corpus.wtok.txt


Jörn

On Mon, Feb 11, 2019 at 2:15 PM Dominik Terweh <d.ter...@drooms.com> wrote:

> Dear Jörn,
>
>
>
> I am working together with Nikolai on the project, and the problem is that
> we tried the alpha numeric optimisation, but it does not seem to change
> anything:
>
> 
>
> dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8
> -model "fr-token.bin" -data corpus.wtok.txt
>
> […]
>
> dominik@debian:~/annotated-corpora/tokenized$ opennlp TokenizerME
> BUILD/fra/fr-token.bin
>
> Pontsmouth
>
> Ponts mouth
>
> dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8
> -alphaNumOpt isAlphaNumOpt -model "fr-token.bin" -data corpus.wtok.txt
>
>                […]
>
> dominik@debian:~$ opennlp TokenizerME fr-token.bin
>
> Pontsmouth
>
> Ponts mouth
>
>
>
> Do you have any suggestions how this could be fixed?
>
> We also tried if maybe the model needs to be run with a flag or something,
> but the TokenizerME did not accept any flag or similar like this.
>
>
>
> Best regards,
>
> Dominik
>
>
>
> On 2019/02/06 19:17:30, Joern Kottmann <k...@gmail.com> wrote:
>
> > Yes it is configurable. There is the so called alpha numeric>
>
> > optimisation, if this is set to true the tokenizer will not split>
>
> > between chars of the same category.>
>
> >
>
> > Jörn>
>
> >
>
> > On Thu, Jan 17, 2019 at 11:08 AM Nikolai Krot <ta...@gmail.com> wrote:>
>
> > >>
>
> > > Hallo OpenNLPists,>
>
> > >>
>
> > > We have trained a Word Tokenizer model for French on our own data and
> see>
>
> > > weird cases where spitting occurs in the middle of a word, like this>
>
> > >>
>
> > > Portsmouth --> Ports mouth>
>
> > >>
>
> > > This is a word from the testing corpus that is normal French text
> found on>
>
> > > the web, though the word itself is not in French.>
>
> > >>
>
> > > I wonder why the word tokenizer attempts to split *between* two
> alphabetic>
>
> > > characters? I can imagine where splitting in the middle of a word can>
>
> > > indeed be useful, like in case of proclitics and enclitics, but I
> would>
>
> > > like to perform the latter as an additional step, making the word
> tokenizer>
>
> > > target only punctuation marks. Is it somehow configurable in OpenNLP?>
>
> > >>
>
> > > Best regards,>
>
> > > Nikolai>
>
> >
>
> Dominik Terweh
> Praktikant
>
> *Drooms GmbH*
> Eschersheimer Landstraße 6
> 60322 Frankfurt, Germany
> www.drooms.com
> Phone:
> Fax:
> Mail: d.ter...@drooms.com
>
>
> <https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature>
>
> *Drooms GmbH*; Sitz der Gesellschaft / Registered Office: Eschersheimer
> Landstr. 6, D-60322 Frankfurt am Main; Geschäftsführung / Management Board:
> Alexandre Grellier;
> Registergericht / Court of Registration: Amtsgericht Frankfurt am Main,
> HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main, USt-IdNr.:
> DE 224007190
>

Re: Tokenizing untokenizable (French)

Reply via email to