Re: Tokenizing untokenizable (French)

Nikolai Krot Fri, 22 Feb 2019 01:11:38 -0800

Hi Joern,

This indeed solves the issue. Thank you very much for heads up!


Best regards,
Nikolai

On Tue, Feb 19, 2019 at 11:10 AM Joern Kottmann <kottm...@gmail.com> wrote:

> I think the usage information for the tokenizer trainer is misleading.
>
> You need to to do it like this:
> opennlp TokenizerTrainer -lang fra -encoding UTF8 -alphaNumOpt true -model
> "fr-token.bin" -data corpus.wtok.txt
>
> Jörn
>
> On Mon, Feb 11, 2019 at 2:15 PM Dominik Terweh <d.ter...@drooms.com>
> wrote:
>
> > Dear Jörn,
> >
> >
> >
> > I am working together with Nikolai on the project, and the problem is
> that
> > we tried the alpha numeric optimisation, but it does not seem to change
> > anything:
> >
> > 
> >
> > dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8
> > -model "fr-token.bin" -data corpus.wtok.txt
> >
> > […]
> >
> > dominik@debian:~/annotated-corpora/tokenized$ opennlp TokenizerME
> > BUILD/fra/fr-token.bin
> >
> > Pontsmouth
> >
> > Ponts mouth
> >
> > dominik@debian:~ $ opennlp TokenizerTrainer -lang fra -encoding UTF8
> > -alphaNumOpt isAlphaNumOpt -model "fr-token.bin" -data corpus.wtok.txt
> >
> >                […]
> >
> > dominik@debian:~$ opennlp TokenizerME fr-token.bin
> >
> > Pontsmouth
> >
> > Ponts mouth
> >
> >
> >
> > Do you have any suggestions how this could be fixed?
> >
> > We also tried if maybe the model needs to be run with a flag or
> something,
> > but the TokenizerME did not accept any flag or similar like this.
> >
> >
> >
> > Best regards,
> >
> > Dominik
> >
> >
> >
> > On 2019/02/06 19:17:30, Joern Kottmann <k...@gmail.com> wrote:
> >
> > > Yes it is configurable. There is the so called alpha numeric>
> >
> > > optimisation, if this is set to true the tokenizer will not split>
> >
> > > between chars of the same category.>
> >
> > >
> >
> > > Jörn>
> >
> > >
> >
> > > On Thu, Jan 17, 2019 at 11:08 AM Nikolai Krot <ta...@gmail.com>
> wrote:>
> >
> > > >>
> >
> > > > Hallo OpenNLPists,>
> >
> > > >>
> >
> > > > We have trained a Word Tokenizer model for French on our own data and
> > see>
> >
> > > > weird cases where spitting occurs in the middle of a word, like this>
> >
> > > >>
> >
> > > > Portsmouth --> Ports mouth>
> >
> > > >>
> >
> > > > This is a word from the testing corpus that is normal French text
> > found on>
> >
> > > > the web, though the word itself is not in French.>
> >
> > > >>
> >
> > > > I wonder why the word tokenizer attempts to split *between* two
> > alphabetic>
> >
> > > > characters? I can imagine where splitting in the middle of a word
> can>
> >
> > > > indeed be useful, like in case of proclitics and enclitics, but I
> > would>
> >
> > > > like to perform the latter as an additional step, making the word
> > tokenizer>
> >
> > > > target only punctuation marks. Is it somehow configurable in
> OpenNLP?>
> >
> > > >>
> >
> > > > Best regards,>
> >
> > > > Nikolai>
> >
> > >
> >
> > Dominik Terweh
> > Praktikant
> >
> > *Drooms GmbH*
> > Eschersheimer Landstraße 6
> > 60322 Frankfurt, Germany
> > www.drooms.com
> > Phone:
> > Fax:
> > Mail: d.ter...@drooms.com
> >
> >
> > <
> https://drooms.com/en/newsletter?utm_source=newslettersignup&utm_medium=emailsignature
> >
> >
> > *Drooms GmbH*; Sitz der Gesellschaft / Registered Office: Eschersheimer
> > Landstr. 6, D-60322 Frankfurt am Main; Geschäftsführung / Management
> Board:
> > Alexandre Grellier;
> > Registergericht / Court of Registration: Amtsgericht Frankfurt am Main,
> > HRB 76454; Finanzamt / Tax Office: Finanzamt Frankfurt am Main,
> USt-IdNr.:
> > DE 224007190
> >
>

Re: Tokenizing untokenizable (French)

Reply via email to