Dear Jörn, Thank you very much, I still have not really understood what these generators are, but I will dive into it :-)
I'm aware of the Conll data, but because the (German) data is not free available, I thought it wouldn't be a bad idea to have at least a small, but free available trained NE model. Best, Tom Am 10.10.2013 14:39, schrieb Jörn Kottmann: > On 10/10/2013 11:58 AM, Thomas Zastrow wrote: >> Hello, >> >> There seems to be no free German NE model available, so I started to >> think about creating one - just using free resources like Wikipedia etc. >> >> I still have some questions: >> >> Somewhere in the documnetation, I read about a dictionary driven NE >> recognizer in OpenNLP. But I didn't found any further information >> about it. Anyway, would it be possible to combine the statistic >> approach with dictionaries? For example, having a list of country >> names would be useful. >> > > Yes that is possible, we have a DictionaryFeatureGenerator which can > lookup names in a dictionary and produces features for them. > There is an xml file you can create to describe how the feature > generation should be setup for training, the file is then stored in the > model > to be able to reproduce the exact same feature generation when the model > is loaded later. > > See our documentation: > http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.featuregen > > > What are the feature you would like to generate via the dictionary? > > The Name Finder can be extended with custom feature generators, in case > you have some ideas or just want to experiment a bit. > >> As far as I understood, the name finder is at the moment only stable >> for one property, like person names. I would like to have the >> traditional divison into persons, locations, organizations and misc. >> When creating manually the training data, would it be OK to add all >> four kinds already to the text and then, maybe create later 4 models >> for the different properties? > > The name finder trainer by default trains a model for all name types > occurring in the training data, the -nameTypes option can reduce the > used types > to one or multiple. I often use this, it works great. > >> The name finder uses as input sentences and tokens. Would it be OK to >> also have POS tags assigned to the training data? That would make it >> much easier to manually annotate the data when e.g. NEs are already >> marked by the POS tagger. >> > > Passing in pos tags is currently not supported by our API. The easiest > way to get around that limitation is probably > to run the pos taggger as part of the name finder feature generation. > > There is German CONLL training data you could use to train a name finder > model: > http://www.cnts.ua.ac.be/conll2003/ner/ > > The OpenNLP Name Finder can be directly trained on the CONLL2003 data. > > HTH, > Jörn
