Re: NE Training + Dictionary?

Jim Thu, 10 Oct 2013 03:56:41 -0700

On 10/10/13 10:58, Thomas Zastrow wrote:

Somewhere in the documnetation, I read about a dictionary driven NErecognizer in OpenNLP. But I didn't found any further informationabout it. Anyway, would it be possible to combine the statisticapproach with dictionaries? For example, having a list of countrynames would be useful.

well, the short answer is no...openNLP doesn't allow you to create anaggregate name-finder which combines the predictions from several othername-finders (regex, dictionary, maxent etc etc).

Now, that said, what you're asking is perfectly reasonable and in fact,I'm maintaining a personal build of openNLP (1.5.3) which is sort ofhacked in order to allow exactly what you propose, for my own research.The 2 major additions/modifications I've incorporated are a single classcalled 'AggregateNameFinder' and the DictionaryNameFinder is slightlymodified to allow an arbitrary number of actual dictionaries to be usedinstead of creating a brand new name-finder per physical dictionary.

After having done all that I am now in a position where I can createan AggregateNameFinder which accepts any number of name-finder duringconstruction. It acts as a regular name-finder but when the time comesto predict, it consults all the internal name-finders and 'merges' theirpredictions.

Now, there are a couple of things to think about when doing this...Firstof all, what happens when some model(s) give a positive prediction butsome other model(s) gave a negative prediction? There is generally noway to decide upfront what to keep and what to throw out but dependingon your use-case you could special-case certain models. For instance ifyou've got a good quality dictionary chances are you can trust it. Thatmeans that whenever the dictionary says "YES", you accept it blindly.Basically here is the rule that my code follows:


 * if more than one models  say 'yes' on the same token, keep the one
   with greater confidence (obviously!)

but then you have to keep in mind that some models do not predict basedon probabilities (deterministic models -> dictionary, regex) andtherefore you have to create artificial ones in order for the rule towork. So, If you've got a good dictionary you can give 100% confidenceto all predictions originating from the dictionary. You can do the samewith any regex models that you might have (as long as you can be surethat your regex patterns are indeed 100% precise cos you're essentiallyhardcoding the confidence (there are ways of confirming that in certaincases).

As far as I understood, the name finder is at the moment only stablefor one property, like person names. I would like to have thetraditional divison into persons, locations, organizations and misc.When creating manually the training data, would it be OK to add allfour kinds already to the text and then, maybe create later 4 modelsfor the different properties?

There is no reason to create 4 different models. Just put all kinds ofNEs in the training set and the resulting model will be bale torecognise all of them (assuming you've got enough data of course).

The name finder uses as input sentences and tokens. Would it be OK toalso have POS tags assigned to the training data? That would make itmuch easier to manually annotate the data when e.g. NEs are alreadymarked by the POS tagger.

If it matters to you, you can specify a feature that looks at thePOS-tag of each token. Even though the pos-tag can indeed be a veryinformative feature, I would suggest to try without that feature firstas it involves running the POS-tagger which is quite computationallyexpensive. If your experiment shows that the name-finder recognisestokens that are of the wrong pos-tag, then you can start thinking aboutthat extra feature. I think that for persons, locations & organisationsyou won't really need it.


hope that helps, :)

Jim

Re: NE Training + Dictionary?

Reply via email to