Re: TokenizerTrainer

Jörn Kottmann Wed, 13 Mar 2013 07:53:24 -0700

Hello,

can you tell us a bit more about your training data. Did you manually
annotate these 300k sentences?
Is it possible to post 10 lines or so here?


Jörn

On 03/12/2013 03:22 PM, Andreas Niekler wrote:

Dear List,

i created a Tokenizer Model with 300k german Sentences from a very clean
corpus. I see some words that are very strangly separated by a tokenizer
using this model like:

stehenge - blieben
fre - undlicher

and so on. I cant find those in my training data and wonder why openNLP
splits those words without any evidence in the training data and wihout
any whitespace in my text files. I trained the model with 500
Iterations, cutoff 5 and alphanumeric optimisation.

Can anyone state some ideas how i can prevent this?

thank you

Andreas

Re: TokenizerTrainer

Reply via email to