You should use a few hundred maybe up to a bit over a thousand to get good
performance.

The model training command looks good. To get anything detecetd you will
need more data. And I would use the perceptron with a cutoff of zero
instead the default maxent with cutoff of five.

HTH,
Jörn


On Sun, Mar 30, 2014 at 7:01 AM, Stuart Robinson <[email protected]
> wrote:

> Thanks, Sanjeev. I was actually asking about the data used to train the
> tokenizers provided by OpenNLP. I'll start a new thread to prevent
> confusion. Sorry about that.
>
>
> On Sat, Mar 29, 2014 at 7:23 PM, Sanjeev Sharma <
> [email protected]> wrote:
>
> > Sorry, can't share the data due to privacy concerns.  The way I got this
> > data was to extract text from word doc resumes, cat them into a single
> text
> > file, and tagged only the names using <START:person> and <END> tags.  I'm
> > using 20 or so resumes for initial experimentation, but the actual
> training
> > data will have several hundred resumes.
> >
> > -----Original Message-----
> > From: Stuart Robinson [mailto:[email protected]]
> > Sent: Saturday, March 29, 2014 8:01 PM
> > To: [email protected]
> > Subject: Re: Training new models
> >
> > Is the training data used to train the tokenizer models available?
> > Specifically, I'm interested in the data used to train the English
> > tokenizer:
> >
> > http://opennlp.sourceforge.net/models-1.5/en-token.bin
> >
> > Thanks,
> > Stuart Robinson
> >
> > > On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma
> > > <[email protected]> wrote:
> > >
> > > Jorn,
> > >
> > > Thanks you for your reply.  Here is what I tried as a simple test:
> > >
> > > - tagged the names on about 20 resumes using "<START:person><END>"
> > > notation
> > > - concatenated them into a single text file.
> > > - created a new .bin file using the following command
> > >
> > >    >opennlp TokenNameFinderTrainer -model persons.bin -lang en -data
> > > train.txt -encoding UTF-8
> > > - using this model file and TokenNameFinderModel tried to identify a
> > > name in one of the resumes I used for training.  (I can post the code
> > > if you
> > > need.)
> > >
> > > Should this work?  If not, what am I doing wrong?
> > >
> > > Thanks,
> > > Sanjeev.
> > >
> > > -----Original Message-----
> > > From: Jörn Kottmann [mailto:[email protected]]
> > > Sent: Friday, March 28, 2014 5:04 AM
> > > To: [email protected]
> > > Subject: Re: Training new models
> > >
> > >> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote:
> > >> Hi,
> > >>
> > >>
> > >>
> > >> I am new to OpenNLP.  I've been playing with chunking, tokenizing,
> > >> POS tagging, and Name recognition for a few days.  I've been
> > >> following the example code and using preexisting models from
> > >> http://opennlp.sourceforge.net/models-1.5/.  I've been having some
> > >> trouble with name recognition and organization recognition in that
> > >> using the above mentioned models I can only identify common names or
> > >> organizations like "Mike Smith" and "IBM".  In addition I need to be
> > >> able to find date ranges and technical language like "Java", "C++",
> > >> and "HTML" (I should mention that my input is going to be resumes).
> > >>
> > >>
> > >>
> > >> I figured I need to train my own models, especially since my training
> > >> data should look more like my input to give a better context (i.e.
> > > resumes).
> > >> I've been trying to find some information on how to do this in the
> > >> documentation and also doing google searches.  I found a few simple
> > >> examples, but not much more.  I did see the example in the
> > >> documentation with the "<START:person> <END>" tags and the command
> > >> line to process the training data into a .bin file, but nothing with
> > >> organization names.  I tried to look at one or two of the annotation
> > >> guides and that created more questions than answers (for example, the
> > >> annotation guides not consistent with each other or the example in
> > >> the documentation.  Are there pros and cons between the different
> > >> formats?
> > >> Are the examples in the documentation in a native format?  Is there a
> > >> conversion utility?  If so and I'm creating data from scratch, would
> > >> it not be better to just put it in the native
> > >> format?)
> > >>
> > >>
> > >>
> > >> I just lack understanding of OpenNLP and NLP in general and the
> > >> OpenNLP Manual just hasn't worked for me.  Maybe I'm just
> > >> misinterpreting the documentation or just not looking in the right
> > >> place.  I would appreciate it greatly if someone could point me in
> > >> the right direction in the way of real world examples of training a
> > >> model, recommending a book I can read through, or maybe just some
> > >> good examples of training data.  Beyond the specific task I'm trying
> > >> to accomplish, I would like to get a deeper understanding of how
> > >> OpenNLP
> > > works.
> > >
> > > Hello,
> > >
> > > the OpenNLP Name Finder training format is rather simple, as you
> > > already figured out, you need to use the <START:entity_name> and <END>
> > > tags to mark the name in tokenized plain text documents.
> > >
> > > In the example above you could replace <START:person> with
> > > <START:organization> to markup an organization name in your text.
> > >
> > > To create a model which performs on your documents you will have to
> > > label quite a few of them and using a text editor to insert the tags
> > > is an approach which does not scale for more than a few documents.
> > >
> > > I suggest to have a look at brat:
> > > http://brat.nlplab.org/
> > >
> > > Brat has a few issues in the 1.3 release version, but they are now
> > > resolved in the trunk, I recommend to use it instead of 1.3.
> > >
> > > The OpenNLP Name Finder in the trunk version can be directly trained
> > > on the brat format.
> > > If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to
> > > convert the data into the above discussed OpenNLP format.
> > >
> > > I know a few people who have done this successfully. Let us know if
> > > you have an issues, and a contribution about this process to our
> > > documentation would be very welcome!
> > >
> > > HTH,
> > > Jörn
> >
>

Reply via email to