Here you may find a reported experience where the author used DBPedia and wikipedia
[1] http://blogs.nuxeo.com/dev/2011/01/mining-wikipedia-with-hadoop-and-pig-for-natural-language-processing.html On Tue, Jan 29, 2013 at 4:43 PM, Christian Moen <[email protected]> wrote: > Hello, > > We've done some experiments trying to synthesise a NER corpus from Wikipedia > using various heuristics and link-structure analyses. However, our models > didn't turn out very good when scored against a gold standard tagged by > humans. I'm sure there are many improvements we could consider, but we > didn't find pursuing this any further all that promising. Basically, there > were too many issues to consider to make the corpus of good quality. I > believe academic research in the field had similar challenges. This was a > quite fun little study, though. > > > Christian Moen > アティリカ株式会社 > http://www.atilika.com > > On Jan 28, 2013, at 5:20 PM, Svetoslav Marinov > <[email protected]> wrote: > >> Wikipedia is not a good source for training. I've tried that but not all >> entities in a text a tagged. Sometimes just the first occurrence of an >> entity is tagged and the rest are not, or partially. To me the tagging >> seemed so random that it does not pass eny criteria for a good corpus. And >> then comes the question of how to distinguish people from places from >> events or any other entities. >> >> For me, in order to use Wikipedia, one will need to do a lot of extra >> processing before some decent quality is achieved. >> >> Svetoslav >> >> >> >> On 2013-01-28 05:31, "Lance Norskog" <[email protected]> wrote: >> >>> Yes. The wikipedia XML has person/place/etc. tags in all of the article >>> text. >>> >>> On 01/27/2013 08:15 PM, John Stewart wrote: >>>> Lance, could you say more? Do you mean WP tagging as training data for >>>> the >>>> NER task? >>>> >>>> Thanks, >>>> >>>> jds >>>> >>>> >>>> On Sun, Jan 27, 2013 at 11:07 PM, Lance Norskog <[email protected]> >>>> wrote: >>>> >>>>> The Wikipedia tagging should provide very good training sets. Has >>>>> anybody >>>>> tried using them? >>>>> >>>>> >>>>> On 01/25/2013 02:14 AM, Jörn Kottmann wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> well, the main problem with the models on SourceForge is that they >>>>>> were >>>>>> trained on news data >>>>>> from the 90s and do not perform very well on todays news articles or >>>>>> out >>>>>> of domain data (anything else). >>>>>> >>>>>> When I speak here and there to our users I always get the impression >>>>>> that >>>>>> most people are still happy >>>>>> with the performance of the Tokenizer, Sentence Splitter and POS >>>>>> Tagger, >>>>>> many are disappointed about the >>>>>> Name Finder models, anyway the name finder works well if trained on >>>>>> your >>>>>> own data. >>>>>> >>>>>> Maybe the OntoNotes Corpus is something worth looking into. >>>>>> >>>>>> The licensing is a gray area, you can probably get away using the >>>>>> models >>>>>> in commercial software. The corpus >>>>>> producers often restrict the usage of their corpus for research >>>>>> purposes >>>>>> only. The question is if they can enforce >>>>>> these restrictive terms also on statistical models build on the data, >>>>>> since the model probably don't violate the >>>>>> copyright. Sorry for not having a better answer, you probably need to >>>>>> ask >>>>>> a lawyer. >>>>>> >>>>>> The evaluations in the documentation are often just samples to >>>>>> illustrate >>>>>> how to use the tools. >>>>>> Have a look at at the test plans in our wiki, we record the >>>>>> performance >>>>>> of OpenNLP there for every release we make. >>>>>> >>>>>> The models are mostly trained with default feature generation, have a >>>>>> look at the documentation and our code >>>>>> to get more details about it. The feature are not yet documented well, >>>>>> but a documentation patch to fix this >>>>>> would be very welcome! >>>>>> >>>>>> HTH, >>>>>> Jörn >>>>>> >>>>>> On 01/25/2013 10:36 AM, Christian Moen wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> I'm exploring the possibility of using OpenNLP in commercial >>>>>>> software. >>>>>>> As part of this, I'd like to assess the quality of some of the >>>>>>> models >>>>>>> available on >>>>>>> http://opennlp.sourceforge.**net/models-1.5/<http://opennlp.sourceforg >>>>>>> e.net/models-1.5/>and also learn more about the applicable license >>>>>>> terms. >>>>>>> >>>>>>> My primary interest for now are the English models for Tokenizer, >>>>>>> Sentence Detector and POS Tagger. >>>>>>> >>>>>>> The documentation on >>>>>>> http://opennlp.apache.org/**documentation/1.5.2-** >>>>>>> >>>>>>> incubating/manual/opennlp.html<http://opennlp.apache.org/documentation >>>>>>> /1.5.2-incubating/manual/opennlp.html>provides scores for various >>>>>>> models as part of evaluation run examples. Do >>>>>>> these scores generally reflect those of the models on the SourceForge >>>>>>> download page? Are further details on model quality, source corpora, >>>>>>> features used, etc. available? >>>>>>> >>>>>>> I've seen posts to this list explain that "the models are subject to >>>>>>> the >>>>>>> licensing restrictions of the copyright holders of the corpus used >>>>>>> to train >>>>>>> them." as a general comment. I understand that the models on >>>>>>> SourceForge >>>>>>> aren't part of any Apache OpenNLP release, but I'd very much >>>>>>> appreciate if >>>>>>> someone in the know could provide further insights into licensing >>>>>>> terms >>>>>>> applicable. I'd be glad to be wrong about this, but my >>>>>>> understanding is >>>>>>> that the models can't be used commercially. >>>>>>> >>>>>>> Many thanks for any insight. >>>>>>> >>>>>>> >>>>>>> Christian >>>>>>> >>>>>>> >>>>>>> >>> >>> >> >> > -- Dr. Nicolas Hernandez Associate Professor (Maître de Conférences) Université de Nantes - LINA CNRS UMR 6241 http://enicolashernandez.blogspot.com http://www.univ-nantes.fr/hernandez-n +33 (0)2 51 12 53 94 +33 (0)2 40 30 60 67
