Hi, It all depends on the uses you want to give to OpenNLP. If you want to do basic research, you will still need to acquire the standard corpora for a given task (e.g., Penn Treebank for English parsing, etc.) if you want to be able to compare with previous approaches and for your results to be publishable.
Cheers, Rodrigo On Tue, Oct 1, 2013 at 6:02 PM, Jörn Kottmann <[email protected]> wrote: > On 10/01/2013 05:36 PM, Ryan Josal wrote: >> >> That is what I'm doing. I've set up semaphore pools for all my >> TokenNameFinders. I would wonder if there's any technical concession one >> would have to make a TokenNameFinder thread safe. What would happen to the >> adaptive data? On the topic of models, the sourceforge ones have been >> certainly useful; I'm mainly using the NER models, but indeed more models, >> or models trained on more recent data would be nice. But I know training >> data, even without annotations doesn't come out of thin air, otherwise I'd >> have created a few models myself. > > > If there is an interest and contributors it would be possible to label > wikinews data (we worked a bit on that), but sure there are more sources > of documents which could be obtained with an Apache compatible license. > > Anyway I guess the process to create training data as part of the OpenNLP > process would be kind of as follows: > - Obtain some raw text > - Write an annotation guide (maybe based on some existing ones) > - Agree on an annotation tool to use (e.g. brat) > - Annotate a few hundred documents > - Make the first release of the corpus > > Jörn
