Re: NE Training + Dictionary?

Thomas Zastrow Thu, 10 Oct 2013 11:39:45 -0700

Dear Jim,

Thank you very much, I'm aware of the problems when different kinds of
name finder algorithms are "mixed". I'm not sure, but I would say when
the training data was not really big enough, a small but high quality
dictionary should be trusted always instead of the statistical analysis.
But maybe I'm wrong, have to try it :-)


Btw., is there somewhere documentation about the dictionary format and
how tom use them from Java code? I didn't found anything about
dictionaries in the official docs?

Best regards,

Tom


Am 10.10.2013 12:54, schrieb Jim:
> On 10/10/13 10:58, Thomas Zastrow wrote:
>> Somewhere in the documnetation, I read about a dictionary driven NE
>> recognizer in OpenNLP. But I didn't found any further information
>> about it. Anyway, would it be possible to combine the statistic
>> approach with dictionaries? For example, having a list of country
>> names would be useful. 
> 
> 
> well, the short answer is no...openNLP doesn't allow you to create an
> aggregate name-finder which combines the predictions from several other
> name-finders (regex, dictionary, maxent etc etc).
> 
> Now, that said, what you're asking is perfectly reasonable and in fact,
> I'm maintaining a personal build of openNLP (1.5.3) which is sort of
> hacked in order to allow exactly what you propose, for my own research.
> The 2 major additions/modifications I've incorporated are a single class
> called 'AggregateNameFinder' and the DictionaryNameFinder is slightly
> modified to allow an arbitrary number of actual dictionaries to be used
> instead of creating a brand new name-finder per physical dictionary.
> 
> After having done all that I am  now in a position where  I can create
> an AggregateNameFinder which accepts any number of name-finder during
> construction. It acts as a regular name-finder but when the time comes
> to predict, it consults all the internal name-finders and 'merges' their
> predictions.
> 
> Now, there are a couple of things to think about when doing this...First
> of all, what happens when some model(s) give a positive prediction but
> some other model(s) gave a negative prediction? There is generally no
> way to decide upfront what to keep and what to throw out but depending
> on your use-case you could special-case certain models. For instance if
> you've got a good quality dictionary chances are you can trust it. That
> means that whenever the dictionary says "YES", you accept it blindly.
> Basically here is the rule that my code follows:
> 
>  * if more than one models  say 'yes' on the same token, keep the one
>    with greater confidence (obviously!)
> 
> but then you have to keep in mind that some models do not predict based
> on probabilities (deterministic models -> dictionary, regex) and
> therefore you have to create artificial ones in order for the rule to
> work. So, If you've got a good dictionary you can give 100% confidence
> to all predictions originating from the dictionary. You can do the same
> with any regex models that you might have (as long as you can be sure
> that your regex patterns are indeed 100% precise cos you're essentially
> hardcoding the confidence (there are ways of confirming that in certain
> cases).
> 
> 
>> As far as I understood, the name finder is at the moment only stable
>> for one property, like person names. I would like to have the
>> traditional divison into persons, locations, organizations and misc.
>> When creating manually the training data, would it be OK to add all
>> four kinds already to the text and then, maybe create later 4 models
>> for the different properties?
> 
> 
> There is no reason to create 4 different models. Just put all kinds of
> NEs in the training set and the resulting model will be bale to
> recognise all of them (assuming you've got enough data of course).
> 
>> The name finder uses as input sentences and tokens. Would it be OK to
>> also have POS tags assigned to the training data? That would make it
>> much easier to manually annotate the data when e.g. NEs are already
>> marked by the POS tagger.
> 
> If it matters to you, you can specify a feature that looks at the
> POS-tag of each token. Even though the pos-tag can indeed be a very
> informative feature, I would suggest to try without that feature first
> as it involves running the POS-tagger which is quite computationally
> expensive. If your experiment shows that the name-finder recognises
> tokens that are of the wrong pos-tag, then you can start thinking about
> that extra feature. I think that for persons, locations & organisations
> you won't really need it.
> 
> hope that helps, :)
> 
> Jim
> 
> 
>

Re: NE Training + Dictionary?

Reply via email to