On 10/10/13 10:58, Thomas Zastrow wrote:
Somewhere in the documnetation, I read about a dictionary driven NE recognizer in OpenNLP. But I didn't found any further information about it. Anyway, would it be possible to combine the statistic approach with dictionaries? For example, having a list of country names would be useful.


well, the short answer is no...openNLP doesn't allow you to create an aggregate name-finder which combines the predictions from several other name-finders (regex, dictionary, maxent etc etc).

Now, that said, what you're asking is perfectly reasonable and in fact, I'm maintaining a personal build of openNLP (1.5.3) which is sort of hacked in order to allow exactly what you propose, for my own research. The 2 major additions/modifications I've incorporated are a single class called 'AggregateNameFinder' and the DictionaryNameFinder is slightly modified to allow an arbitrary number of actual dictionaries to be used instead of creating a brand new name-finder per physical dictionary.

After having done all that I am now in a position where I can create an AggregateNameFinder which accepts any number of name-finder during construction. It acts as a regular name-finder but when the time comes to predict, it consults all the internal name-finders and 'merges' their predictions.

Now, there are a couple of things to think about when doing this...First of all, what happens when some model(s) give a positive prediction but some other model(s) gave a negative prediction? There is generally no way to decide upfront what to keep and what to throw out but depending on your use-case you could special-case certain models. For instance if you've got a good quality dictionary chances are you can trust it. That means that whenever the dictionary says "YES", you accept it blindly. Basically here is the rule that my code follows:

 * if more than one models  say 'yes' on the same token, keep the one
   with greater confidence (obviously!)

but then you have to keep in mind that some models do not predict based on probabilities (deterministic models -> dictionary, regex) and therefore you have to create artificial ones in order for the rule to work. So, If you've got a good dictionary you can give 100% confidence to all predictions originating from the dictionary. You can do the same with any regex models that you might have (as long as you can be sure that your regex patterns are indeed 100% precise cos you're essentially hardcoding the confidence (there are ways of confirming that in certain cases).


As far as I understood, the name finder is at the moment only stable for one property, like person names. I would like to have the traditional divison into persons, locations, organizations and misc. When creating manually the training data, would it be OK to add all four kinds already to the text and then, maybe create later 4 models for the different properties?


There is no reason to create 4 different models. Just put all kinds of NEs in the training set and the resulting model will be bale to recognise all of them (assuming you've got enough data of course).

The name finder uses as input sentences and tokens. Would it be OK to also have POS tags assigned to the training data? That would make it much easier to manually annotate the data when e.g. NEs are already marked by the POS tagger.

If it matters to you, you can specify a feature that looks at the POS-tag of each token. Even though the pos-tag can indeed be a very informative feature, I would suggest to try without that feature first as it involves running the POS-tagger which is quite computationally expensive. If your experiment shows that the name-finder recognises tokens that are of the wrong pos-tag, then you can start thinking about that extra feature. I think that for persons, locations & organisations you won't really need it.

hope that helps, :)

Jim


Reply via email to