You should use a few hundred maybe up to a bit over a thousand to get good performance.
The model training command looks good. To get anything detecetd you will need more data. And I would use the perceptron with a cutoff of zero instead the default maxent with cutoff of five. HTH, Jörn On Sun, Mar 30, 2014 at 7:01 AM, Stuart Robinson <[email protected] > wrote: > Thanks, Sanjeev. I was actually asking about the data used to train the > tokenizers provided by OpenNLP. I'll start a new thread to prevent > confusion. Sorry about that. > > > On Sat, Mar 29, 2014 at 7:23 PM, Sanjeev Sharma < > [email protected]> wrote: > > > Sorry, can't share the data due to privacy concerns. The way I got this > > data was to extract text from word doc resumes, cat them into a single > text > > file, and tagged only the names using <START:person> and <END> tags. I'm > > using 20 or so resumes for initial experimentation, but the actual > training > > data will have several hundred resumes. > > > > -----Original Message----- > > From: Stuart Robinson [mailto:[email protected]] > > Sent: Saturday, March 29, 2014 8:01 PM > > To: [email protected] > > Subject: Re: Training new models > > > > Is the training data used to train the tokenizer models available? > > Specifically, I'm interested in the data used to train the English > > tokenizer: > > > > http://opennlp.sourceforge.net/models-1.5/en-token.bin > > > > Thanks, > > Stuart Robinson > > > > > On Mar 29, 2014, at 10:12 AM, Sanjeev Sharma > > > <[email protected]> wrote: > > > > > > Jorn, > > > > > > Thanks you for your reply. Here is what I tried as a simple test: > > > > > > - tagged the names on about 20 resumes using "<START:person><END>" > > > notation > > > - concatenated them into a single text file. > > > - created a new .bin file using the following command > > > > > > >opennlp TokenNameFinderTrainer -model persons.bin -lang en -data > > > train.txt -encoding UTF-8 > > > - using this model file and TokenNameFinderModel tried to identify a > > > name in one of the resumes I used for training. (I can post the code > > > if you > > > need.) > > > > > > Should this work? If not, what am I doing wrong? > > > > > > Thanks, > > > Sanjeev. > > > > > > -----Original Message----- > > > From: Jörn Kottmann [mailto:[email protected]] > > > Sent: Friday, March 28, 2014 5:04 AM > > > To: [email protected] > > > Subject: Re: Training new models > > > > > >> On 03/27/2014 11:35 PM, Sanjeev Sharma wrote: > > >> Hi, > > >> > > >> > > >> > > >> I am new to OpenNLP. I've been playing with chunking, tokenizing, > > >> POS tagging, and Name recognition for a few days. I've been > > >> following the example code and using preexisting models from > > >> http://opennlp.sourceforge.net/models-1.5/. I've been having some > > >> trouble with name recognition and organization recognition in that > > >> using the above mentioned models I can only identify common names or > > >> organizations like "Mike Smith" and "IBM". In addition I need to be > > >> able to find date ranges and technical language like "Java", "C++", > > >> and "HTML" (I should mention that my input is going to be resumes). > > >> > > >> > > >> > > >> I figured I need to train my own models, especially since my training > > >> data should look more like my input to give a better context (i.e. > > > resumes). > > >> I've been trying to find some information on how to do this in the > > >> documentation and also doing google searches. I found a few simple > > >> examples, but not much more. I did see the example in the > > >> documentation with the "<START:person> <END>" tags and the command > > >> line to process the training data into a .bin file, but nothing with > > >> organization names. I tried to look at one or two of the annotation > > >> guides and that created more questions than answers (for example, the > > >> annotation guides not consistent with each other or the example in > > >> the documentation. Are there pros and cons between the different > > >> formats? > > >> Are the examples in the documentation in a native format? Is there a > > >> conversion utility? If so and I'm creating data from scratch, would > > >> it not be better to just put it in the native > > >> format?) > > >> > > >> > > >> > > >> I just lack understanding of OpenNLP and NLP in general and the > > >> OpenNLP Manual just hasn't worked for me. Maybe I'm just > > >> misinterpreting the documentation or just not looking in the right > > >> place. I would appreciate it greatly if someone could point me in > > >> the right direction in the way of real world examples of training a > > >> model, recommending a book I can read through, or maybe just some > > >> good examples of training data. Beyond the specific task I'm trying > > >> to accomplish, I would like to get a deeper understanding of how > > >> OpenNLP > > > works. > > > > > > Hello, > > > > > > the OpenNLP Name Finder training format is rather simple, as you > > > already figured out, you need to use the <START:entity_name> and <END> > > > tags to mark the name in tokenized plain text documents. > > > > > > In the example above you could replace <START:person> with > > > <START:organization> to markup an organization name in your text. > > > > > > To create a model which performs on your documents you will have to > > > label quite a few of them and using a text editor to insert the tags > > > is an approach which does not scale for more than a few documents. > > > > > > I suggest to have a look at brat: > > > http://brat.nlplab.org/ > > > > > > Brat has a few issues in the 1.3 release version, but they are now > > > resolved in the trunk, I recommend to use it instead of 1.3. > > > > > > The OpenNLP Name Finder in the trunk version can be directly trained > > > on the brat format. > > > If you want to use OpenNLP 1.5.3 instead you can still use 1.6.0 to > > > convert the data into the above discussed OpenNLP format. > > > > > > I know a few people who have done this successfully. Let us know if > > > you have an issues, and a contribution about this process to our > > > documentation would be very welcome! > > > > > > HTH, > > > Jörn > > >
