Very useful, thank you. Only question I have left now, for the moment, is on performance. The minimum recommend number of sentences is 15,000 does anyone know how much this would need to be increased to before it would, maybe it never would, become a performance issue? So if I created training data with 100,000 sentences would this be an issue? Is there any number I could go to where it would be an issue?
Thanks, Robert > Subject: Re: Name finder questions > To: users@opennlp.apache.org > From: p...@thomas-zastrow.de > Date: Fri, 22 Apr 2016 10:22:50 +0200 > > Here you can find raw data I used to create a German model, maybe its > useful for you: > > http://www.thomas-zastrow.de/nlp/ > > ("Raw trainingdata in OpenNLP format") > > > Am 22.04.2016 um 10:17 schrieb Robert Logue: > > Can anyone help here? I don't want to start creating a large training file > > and find out I have gone about it in the wrong way. > > > > The resources I have been looking at are > > > > https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training > > http://blog.thedigitalgroup.com/sagarg/2015/10/30/open-nlp-name-finder-model-training/ > > http://nishutayaltech.blogspot.co.uk/2015/07/writing-custom-namefinder-model-in.html > > > > None of which gives the answers I am looking for. > > > > Thanks, > > > > Robert > > > >> From: rplo...@hotmail.co.uk > >> To: users@opennlp.apache.org > >> Subject: RE: Name finder questions > >> Date: Wed, 20 Apr 2016 09:51:25 +0100 > >> > >> I have a few questions regarding creating my own training data for the > >> name finder. I would like to distinguish between people, organizations and > >> locations. The example in the documentation shows the tags to use for > >> people ie > >> > >> <START:person> Pierre Vinken <END> , 61 years old , will join the board as > >> a nonexecutive director Nov. 29 .So would I used <START:organization><END> > >> and <START:location><END> for organizations and locations respectively? > >> The name entity guidelines in the documentation ie > >> > >> https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.annotation_guides > >> > >> seem to show different tags getting used which has confused me slightly as > >> to which tags I should actually use? > >> > >> Also I see the 15,000 line recommendation is there any performance hit if > >> you use many more lines? > >> > >> If I create my plain text training file as I outlined above is there any > >> other params that are recommended to use beyond the basic ie > >> > >> opennlp TokenNameFinderTrainer -model OUTPUT_FILE.bin -lang en -data > >> TRAINING_FILE.train -encoding UTF-8 > >> > >> For instance what is the -params training parameters file used for? Is > >> this necessary should this list the named entities I am looking for ie > >> person, organization and location if so what format should it be in? > >> > >> Sorry for the basic questions here but kind find the answers in the > >> documentation or from a quick google. > >> > >> Thanks, > >> > >> Robert > >> > >> > >>> From: rodrigo.age...@ehu.eus > >>> Date: Mon, 18 Apr 2016 09:36:24 +0200 > >>> Subject: Re: Name finder questions > >>> To: users@opennlp.apache.org > >>> > >>> Hello, > >>> > >>> Yes, that is the idea. > >>> > >>> R > >>> > >>> On Sun, Apr 17, 2016 at 9:10 PM, Robert Logue <rplo...@hotmail.co.uk> > >>> wrote: > >>>> I am slightly confused what I can use the data in those links for? So > >>>> can I use this data with the training tool like the following > >>>> > >>>> opennlp TokenNameFinderTrainer -model OUTPUT_FILE_NAME -lang en > >>>> -data DOWNLOADED_FILE_NAME -encoding UTF-8 > >>>> And that should give me a better model file for when I use the name > >>>> finder? > >>>> > >>>> Thanks, > >>>> > >>>> Robert > >>>> > >>>>> From: rodrigo.age...@ehu.eus > >>>>> Date: Fri, 15 Apr 2016 17:12:20 +0200 > >>>>> Subject: Re: Name finder questions > >>>>> To: users@opennlp.apache.org > >>>>> > >>>>> Hi Robert, > >>>>> > >>>>> On Fri, Apr 15, 2016 at 10:25 AM, Robert Logue <rplo...@hotmail.co.uk> > >>>>> wrote: > >>>>>> Hello, > >>>>>> > >>>>>> I have just started using OpenNLP in the java application. I am just > >>>>>> getting my used with the software and have a couple of newbie > >>>>>> questions. > >>>>>> > >>>>>> I see for the name finder there is different model data for people and > >>>>>> organizations (en-ner-organization.bin and en-ner-person.bin). Is > >>>>>> there any way to combine these into one file so I can do 1 search that > >>>>>> will give me back person names and organization names. Or is this not > >>>>>> possible and is it best to do two searches? > >>>>> This used to be experimental. It is not anymore, namely, you can train > >>>>> a name finder model for more than one entity type. The models > >>>>> available were trained with rather old newswire data so I would > >>>>> recommend you to obtain train new models using OpenNLP: > >>>>> > >>>>> http://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.tool > >>>>> > >>>>> I suppose you do not have manually annotated training data so I could > >>>>> recommend to get the Ontonotes corpus. > >>>>> > >>>>> https://catalog.ldc.upenn.edu/LDC2013T19 > >>>>> > >>>>> https://github.com/ontonotes/conll-formatted-ontonotes-5.0 > >>>>> > >>>>> Another option is to get a silver standard corpus obtained > >>>>> automatically from the Wikipedia: > >>>>> > >>>>> http://schwa.org/projects/resources/wiki/Wikiner#Automatic-training-data-from-Wikipedia > >>>>> > >>>>> For Dutch, Spanish, German and Italian (that I know of) there are free > >>>>> resources. Search for Ancora, SONAR-1, GermEval 2014 and Evalita 2009. > >>>>> > >>>>>> This question isn't related to the name finder and I don't think it is > >>>>>> possible but thought I would ask anyway. If I had two sentences say > >>>>>> 'Jack climbed the hill. He was very tired.' Is there any way to know > >>>>>> that the pronoun, he, at the start of the second sentence is actually > >>>>>> about Jack the subject of the first sentence? I know in this simple > >>>>>> case it is obvious but I am wondering if there is anything in the > >>>>>> OpenNLP software that will help with this? > >>>>> The example you mentioned is called "pronominal anaphora" and it > >>>>> generalizes in the coreference resolution problem. There used to be a > >>>>> coreference tool in OpenNLP but got moved to the Sandbox because many > >>>>> things need to be updated to be able to distribute it. > >>>>> > >>>>> See http://conll.cemantix.org/2012/introduction.html for more details. > >>>>> > >>>>> HTH, > >>>>> > >>>>> R > >> > > > > -- > Dr. Thomas Zastrow > Rechenzentrum Garching (RZG) der Max-Planck-Gesellschaft (MPG) > Gießenbachstr. 2, D-85748 Garching bei München, Germany > Tel +49-89-3299-1457 > http://www.rzg.mpg.de >