Dear Jim, Thank you very much, I'm aware of the problems when different kinds of name finder algorithms are "mixed". I'm not sure, but I would say when the training data was not really big enough, a small but high quality dictionary should be trusted always instead of the statistical analysis. But maybe I'm wrong, have to try it :-)
Btw., is there somewhere documentation about the dictionary format and how tom use them from Java code? I didn't found anything about dictionaries in the official docs? Best regards, Tom Am 10.10.2013 12:54, schrieb Jim: > On 10/10/13 10:58, Thomas Zastrow wrote: >> Somewhere in the documnetation, I read about a dictionary driven NE >> recognizer in OpenNLP. But I didn't found any further information >> about it. Anyway, would it be possible to combine the statistic >> approach with dictionaries? For example, having a list of country >> names would be useful. > > > well, the short answer is no...openNLP doesn't allow you to create an > aggregate name-finder which combines the predictions from several other > name-finders (regex, dictionary, maxent etc etc). > > Now, that said, what you're asking is perfectly reasonable and in fact, > I'm maintaining a personal build of openNLP (1.5.3) which is sort of > hacked in order to allow exactly what you propose, for my own research. > The 2 major additions/modifications I've incorporated are a single class > called 'AggregateNameFinder' and the DictionaryNameFinder is slightly > modified to allow an arbitrary number of actual dictionaries to be used > instead of creating a brand new name-finder per physical dictionary. > > After having done all that I am now in a position where I can create > an AggregateNameFinder which accepts any number of name-finder during > construction. It acts as a regular name-finder but when the time comes > to predict, it consults all the internal name-finders and 'merges' their > predictions. > > Now, there are a couple of things to think about when doing this...First > of all, what happens when some model(s) give a positive prediction but > some other model(s) gave a negative prediction? There is generally no > way to decide upfront what to keep and what to throw out but depending > on your use-case you could special-case certain models. For instance if > you've got a good quality dictionary chances are you can trust it. That > means that whenever the dictionary says "YES", you accept it blindly. > Basically here is the rule that my code follows: > > * if more than one models say 'yes' on the same token, keep the one > with greater confidence (obviously!) > > but then you have to keep in mind that some models do not predict based > on probabilities (deterministic models -> dictionary, regex) and > therefore you have to create artificial ones in order for the rule to > work. So, If you've got a good dictionary you can give 100% confidence > to all predictions originating from the dictionary. You can do the same > with any regex models that you might have (as long as you can be sure > that your regex patterns are indeed 100% precise cos you're essentially > hardcoding the confidence (there are ways of confirming that in certain > cases). > > >> As far as I understood, the name finder is at the moment only stable >> for one property, like person names. I would like to have the >> traditional divison into persons, locations, organizations and misc. >> When creating manually the training data, would it be OK to add all >> four kinds already to the text and then, maybe create later 4 models >> for the different properties? > > > There is no reason to create 4 different models. Just put all kinds of > NEs in the training set and the resulting model will be bale to > recognise all of them (assuming you've got enough data of course). > >> The name finder uses as input sentences and tokens. Would it be OK to >> also have POS tags assigned to the training data? That would make it >> much easier to manually annotate the data when e.g. NEs are already >> marked by the POS tagger. > > If it matters to you, you can specify a feature that looks at the > POS-tag of each token. Even though the pos-tag can indeed be a very > informative feature, I would suggest to try without that feature first > as it involves running the POS-tagger which is quite computationally > expensive. If your experiment shows that the name-finder recognises > tokens that are of the wrong pos-tag, then you can start thinking about > that extra feature. I think that for persons, locations & organisations > you won't really need it. > > hope that helps, :) > > Jim > > >
