Hello, In any case, I think its a little bit oldschool to identify tokens and additional annotations just with spaces between them ... what about a nice XML format (no, not that ISO crap .. what about TCF [1])? Or maybe NEGRA?
Best, Tom [1] http://weblicht.sfs.uni-tuebingen.de/weblichtwiki/index.php/The_TCF_Format Am 14.10.2013 21:53, schrieb Charles Martin: > What happens if all the entity tokens are at the beginning of every line? > I find that openlp then thinks that any string near the beginning of a line > is an entity, > regardless of the content or word context > > > > On Mon, Oct 14, 2013 at 12:48 PM, Thomas Zastrow > <[email protected]>wrote: > >> Thanks. That explains a lot ... :-) >> >> Does it play a role it it is one or two blanks? >> >> >> >> Am 14.10.2013 21:44, schrieb William Colen: >>> Yes, it does. Include a blank between any element, including punctuations >>> and annotations. The corpus must be tokenized. >>> >>> >>> 2013/10/14 Thomas Zastrow <[email protected]> >>> >>>> Hello, >>>> >>>> I have a question: when creating training material, does it make a >>>> difference if there are " " (blanks) around the NE? In other words, is >>>> it the same to have: >>>> >>>> <START:loc>Hamburg<END> >>>> >>>> or: >>>> >>>> <START:loc> Hamburg <END> >>>> >>>> The example in the documentation shows up with the " " ... ? >>>> >>>> Best, >>>> >>>> Tom >>>> >>>> P.S.: ca. 1300 sentences for a free German NE model are done :-) >>>> >>> >> >> > >
