On 03/14/2013 11:44 AM, Andreas Niekler wrote:
what i don't understand is how this is producing valid training data
since i just delete whitespaces. You said that i need to include some
<SPLIT> Tags to have proper training data. Can you please comment on the
fact why we have proper training data after detokenizing. I hope that
it's ok to ask all these querstions but i really want to understand
openNLP tokenisation.

The training data needs to reflect the data you want to process.
In German (like in English) most tokens are separated by white spaces already, and punctuation and word tokens might be written together without a separating white space,
to encode the latter case in the training data we use the <SPLIT> tag.

If you just replace all white spaces with <SPLIT> tags in your white space tokenized data the input data probably does not longer match the training data. To make the input data match
it again you would need to remove all white spaces from it.

Can you give us more details about your training data? Is it white space tokenized?

Jörn



Reply via email to