Re: Anyone using the UIMA trainer AEs?

Peter Klügl Fri, 13 Jan 2017 14:34:56 -0800


Am 13.01.2017 um 21:12 schrieb Richard Eckart de Castilho:

...
I think the problem was that the data I had easily available was in a CoNLL 
format - you cannot train a tokenizer from most CoNLL formats because there is 
no information whether two tokens are directly adjacent or not.


Do you have a suggestion for a publicly available corpus that contains offset 
information and which would be suitable?

I do not recall the exact licenses and their implications right now butGenia [1] or English Universal Dependencies [2], for example, should dothe trick (with some converting). Genia contains inline xml tags forwords/tokens and the English UD contains information about the spaces.


Best,

Peter

[1] http://www.geniaproject.org/genia-corpus/pos-annotation
[2] https://github.com/UniversalDependencies/UD_English

Re: Anyone using the UIMA trainer AEs?

Reply via email to