Am 13.01.2017 um 21:12 schrieb Richard Eckart de Castilho:
...
I think the problem was that the data I had easily available was in a CoNLL 
format - you cannot train a tokenizer from most CoNLL formats because there is 
no information whether two tokens are directly adjacent or not.

Do you have a suggestion for a publicly available corpus that contains offset 
information and which would be suitable?


I do not recall the exact licenses and their implications right now but Genia [1] or English Universal Dependencies [2], for example, should do the trick (with some converting). Genia contains inline xml tags for words/tokens and the English UD contains information about the spaces.

Best,

Peter

[1] http://www.geniaproject.org/genia-corpus/pos-annotation
[2] https://github.com/UniversalDependencies/UD_English

Reply via email to