Hello,

can you tell us a bit more about your training data. Did you manually
annotate these 300k sentences?
Is it possible to post 10 lines or so here?

Jörn

On 03/12/2013 03:22 PM, Andreas Niekler wrote:
Dear List,

i created a Tokenizer Model with 300k german Sentences from a very clean
corpus. I see some words that are very strangly separated by a tokenizer
using this model like:

stehenge - blieben
fre - undlicher

and so on. I cant find those in my training data and wonder why openNLP
splits those words without any evidence in the training data and wihout
any whitespace in my text files. I trained the model with 500
Iterations, cutoff 5 and alphanumeric optimisation.

Can anyone state some ideas how i can prevent this?

thank you

Andreas

Reply via email to