> Il giorno 01/ott/2013, alle ore 18:02, Jörn Kottmann <[email protected]> ha > scritto: > >> On 10/01/2013 05:36 PM, Ryan Josal wrote: >> That is what I'm doing. I've set up semaphore pools for all my >> TokenNameFinders. I would wonder if there's any technical concession one >> would have to make a TokenNameFinder thread safe. What would happen to the >> adaptive data? On the topic of models, the sourceforge ones have been >> certainly useful; I'm mainly using the NER models, but indeed more models, >> or models trained on more recent data would be nice. But I know training >> data, even without annotations doesn't come out of thin air, otherwise I'd >> have created a few models myself. > > If there is an interest and contributors it would be possible to label > wikinews data (we worked a bit on that), but sure there are more sources > of documents which could be obtained with an Apache compatible license. > > Anyway I guess the process to create training data as part of the OpenNLP > process would be kind of as follows: > - Obtain some raw text > - Write an annotation guide (maybe based on some existing ones) > - Agree on an annotation tool to use (e.g. brat) > - Annotate a few hundred documents > - Make the first release of the corpus
I'd be interested in creating an annotated corpus for Italian or at least to begin the process. The first problem for me is finding an annotation guide. Does anyone have some links? Re. brat. I have a little familiarity with GATE. How do they compare? Are they even comparable? Thanks -- Giorgio Valoti
