On 10/02/2013 02:06 AM, Mark G wrote:
I've been using OpenNLP for a few years and I find the best results occur
when the models are generated using samples of the data they will be run
against, one of the reasons I like the Maxent approach. I am not sure
attempting to provide models will bear much fruit other than users will no
longer be afraid of the licensing issues associated with using them in
commercial systems. I do strongly think we should provide a modelbuilding
framework (that calls the training api) and a default impl.
Coincidentally....I have been building a framework and impl over the last
few months that creates models based on seeding an iterative process with
known entities and iterating through a set of supplied sentences to
recursively create annotations, write them, create a maxentmodel, load the
model, create more annotations based on the results (there is a validation
object involved), and so on.... With this method I was able to create an
NER model for people's names against a 200K sentence corpus that returns
acceptable results just by starting with a list of five highly unambiguous
names. I will propose the framework in more detail in the coming days and
supply my impl if everyone is interested.
As for the initial question, I would like to see OpenNLP provide a
framework for rapidly/semi-automatically building models out of user data,
and also performing entity resolution across documents, in order to assign
a probability to whether the "Bob" in one document is the same as "Bob" in
another.


Sounds very interesting. The sentence-wise training data which is produced this way could also be combined with existing training data, or just be used to bootstrap a model to more
efficiently label data with a document-level annotation tool.

Another aspect is that this tool might be good at detecting mistakes in existing training data.

Jörn


Reply via email to