On 13.01.2017, at 11:15, Peter Klügl <peter.klu...@averbis.com> wrote: > > Am 13.01.2017 um 08:19 schrieb Richard Eckart de Castilho: >> ... >> >> In theory there is also a trainer for the tokenizer, but I haven't been able >> yet to set up a working unit test for it. I think that was due to an >> immediate lack up suitable training data. So it remains on the todo list. >> > > we have several OpenNLP tokenizer models. Aren't most corpora, e.g., > annotated with POS tags, suitable?
I think the problem was that the data I had easily available was in a CoNLL format - you cannot train a tokenizer from most CoNLL formats because there is no information whether two tokens are directly adjacent or not. Do you have a suggestion for a publicly available corpus that contains offset information and which would be suitable? Cheers, -- Richard