Am 13.01.2017 um 21:12 schrieb Richard Eckart de Castilho:
...
I think the problem was that the data I had easily available was in a CoNLL
format - you cannot train a tokenizer from most CoNLL formats because there is
no information whether two tokens are directly adjacent or not.
Do you have a suggestion for a publicly available corpus that contains offset
information and which would be suitable?
I do not recall the exact licenses and their implications right now but
Genia [1] or English Universal Dependencies [2], for example, should do
the trick (with some converting). Genia contains inline xml tags for
words/tokens and the English UD contains information about the spaces.
Best,
Peter
[1] http://www.geniaproject.org/genia-corpus/pos-annotation
[2] https://github.com/UniversalDependencies/UD_English