On 13.01.2017, at 11:15, Peter Klügl <peter.klu...@averbis.com> wrote:
> 
> Am 13.01.2017 um 08:19 schrieb Richard Eckart de Castilho:
>> ...
>> 
>> In theory there is also a trainer for the tokenizer, but I haven't been able 
>> yet to set up a working unit test for it. I think that was due to an 
>> immediate lack up suitable training data. So it remains on the todo list.
>> 
> 
> we have several OpenNLP tokenizer models. Aren't most corpora, e.g.,
> annotated with POS tags, suitable?

I think the problem was that the data I had easily available was in a CoNLL 
format - you cannot train a tokenizer from most CoNLL formats because there is 
no information whether two tokens are directly adjacent or not.

Do you have a suggestion for a publicly available corpus that contains offset 
information and which would be suitable?

Cheers,

-- Richard

Reply via email to