Jorn,
Thanks you for your reply. Here is what I tried as a simple test:
- tagged the names on about 20 resumes using "<START:person><END>"
notation
- concatenated them into a single text file.
- created a new .bin file using the following command
>opennlp TokenNameFinderTrainer -model persons.bin -lang en -data
train.txt -encoding UTF-8
- using this model file and TokenNameFinderModel tried to identify a name
in one of the resumes I used for training. (I can post the code if you
need.)
Should this work? If not, what am I doing wrong?
Thanks,
Sanjeev.
-----Original Message-----
From: Jörn Kottmann [mailto:[email protected]]
Sent: Friday, March 28, 2014 5:04 AM
To: [email protected]
Subject: Re: Training new models
On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
> Hi,
>
>
>
> I am new to OpenNLP. I've been playing with chunking, tokenizing, POS
> tagging, and Name recognition for a few days. I've been following the
> example code and using preexisting models from
> http://opennlp.sourceforge.net/models-1.5/. I've been having some
> trouble with name recognition and organization recognition in that
> using the above mentioned models I can only identify common names or
> organizations like "Mike Smith" and "IBM". In addition I need to be
> able to find date ranges and technical language like "Java", "C++",
> and "HTML" (I should mention that my input is going to be resumes).
>
>
>
> I figured I need to train my own models, especially since my training
> data should look more like my input to give a better context (i.e.
resumes).
> I've been trying to find some information on how to do this in the
> documentation and also doing google searches. I found a few simple
> examples, but not much more. I did see the example in the
> documentation with the "<START:person> <END>" tags and the command
> line to process the training data into a .bin file, but nothing with
> organization names. I tried to look at one or two of the annotation
> guides and that created more questions than answers (for example, the
> annotation guides not consistent with each other or the example in the
> documentation. Are there pros and cons between the different formats?
> Are the examples in the documentation in a native format? Is there a
> conversion utility? If so and I'm creating data from scratch, would
> it not be better to just put it in the native
> format?)
>
>
>
> I just lack understanding of OpenNLP and NLP in general and the
> OpenNLP Manual just hasn't worked for me. Maybe I'm just
> misinterpreting the documentation or just not looking in the right
> place. I would appreciate it greatly if someone could point me in the
> right direction in the way of real world examples of training a model,
> recommending a book I can read through, or maybe just some good
> examples of training data. Beyond the specific task I'm trying to
> accomplish, I would like to get a deeper understanding of how OpenNLP
works.
Hello,
the OpenNLP Name Finder training format is rather simple, as you already
figured out, you need to use the <START:entity_name> and <END> tags to
mark the name in tokenized plain text documents.
In the example above you could replace <START:person> with
<START:organization> to markup an organization name in your text.
To create a model which performs on your documents you will have to label
quite a few of them and using a text editor to insert the tags is an
approach which does not scale for more than a few documents.
I suggest to have a look at brat:
http://brat.nlplab.org/
Brat has a few issues in the 1.3 release version, but they are now
resolved in the trunk, I recommend to use it instead of 1.3.
The OpenNLP Name Finder in the trunk version can be directly trained on
the brat format.
If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to
convert the data into the above discussed OpenNLP format.
I know a few people who have done this successfully. Let us know if you
have an issues, and a contribution about this process to our documentation
would be very welcome!
HTH,
Jörn