On 10/10/13 10:58, Thomas Zastrow wrote:
Somewhere in the documnetation, I read about a dictionary driven NE
recognizer in OpenNLP. But I didn't found any further information
about it. Anyway, would it be possible to combine the statistic
approach with dictionaries? For example, having a list of country
names would be useful.
well, the short answer is no...openNLP doesn't allow you to create an
aggregate name-finder which combines the predictions from several other
name-finders (regex, dictionary, maxent etc etc).
Now, that said, what you're asking is perfectly reasonable and in fact,
I'm maintaining a personal build of openNLP (1.5.3) which is sort of
hacked in order to allow exactly what you propose, for my own research.
The 2 major additions/modifications I've incorporated are a single class
called 'AggregateNameFinder' and the DictionaryNameFinder is slightly
modified to allow an arbitrary number of actual dictionaries to be used
instead of creating a brand new name-finder per physical dictionary.
After having done all that I am now in a position where I can create
an AggregateNameFinder which accepts any number of name-finder during
construction. It acts as a regular name-finder but when the time comes
to predict, it consults all the internal name-finders and 'merges' their
predictions.
Now, there are a couple of things to think about when doing this...First
of all, what happens when some model(s) give a positive prediction but
some other model(s) gave a negative prediction? There is generally no
way to decide upfront what to keep and what to throw out but depending
on your use-case you could special-case certain models. For instance if
you've got a good quality dictionary chances are you can trust it. That
means that whenever the dictionary says "YES", you accept it blindly.
Basically here is the rule that my code follows:
* if more than one models say 'yes' on the same token, keep the one
with greater confidence (obviously!)
but then you have to keep in mind that some models do not predict based
on probabilities (deterministic models -> dictionary, regex) and
therefore you have to create artificial ones in order for the rule to
work. So, If you've got a good dictionary you can give 100% confidence
to all predictions originating from the dictionary. You can do the same
with any regex models that you might have (as long as you can be sure
that your regex patterns are indeed 100% precise cos you're essentially
hardcoding the confidence (there are ways of confirming that in certain
cases).
As far as I understood, the name finder is at the moment only stable
for one property, like person names. I would like to have the
traditional divison into persons, locations, organizations and misc.
When creating manually the training data, would it be OK to add all
four kinds already to the text and then, maybe create later 4 models
for the different properties?
There is no reason to create 4 different models. Just put all kinds of
NEs in the training set and the resulting model will be bale to
recognise all of them (assuming you've got enough data of course).
The name finder uses as input sentences and tokens. Would it be OK to
also have POS tags assigned to the training data? That would make it
much easier to manually annotate the data when e.g. NEs are already
marked by the POS tagger.
If it matters to you, you can specify a feature that looks at the
POS-tag of each token. Even though the pos-tag can indeed be a very
informative feature, I would suggest to try without that feature first
as it involves running the POS-tagger which is quite computationally
expensive. If your experiment shows that the name-finder recognises
tokens that are of the wrong pos-tag, then you can start thinking about
that extra feature. I think that for persons, locations & organisations
you won't really need it.
hope that helps, :)
Jim