Hello, can you tell us a bit more about your training data. Did you manually annotate these 300k sentences? Is it possible to post 10 lines or so here?
Jörn On 03/12/2013 03:22 PM, Andreas Niekler wrote:
Dear List, i created a Tokenizer Model with 300k german Sentences from a very clean corpus. I see some words that are very strangly separated by a tokenizer using this model like: stehenge - blieben fre - undlicher and so on. I cant find those in my training data and wonder why openNLP splits those words without any evidence in the training data and wihout any whitespace in my text files. I trained the model with 500 Iterations, cutoff 5 and alphanumeric optimisation. Can anyone state some ideas how i can prevent this? thank you Andreas
