Hi - Can you ensure that your training data is in format like mentioned in wiki ? [0]
Like mentioned in wiki training should be something like this- <START:person> Pierre Vinken <END> 61 years old , will join the board as a nonexecutive director Nov. 29 Here Type of Entity is "person" and "Pierre Vinken" is one of the person in training data. I was looking at links you shared and your data looks in different format. Can you ensure your eng.train is in above format? I think you can write your own code to read training file and convert it into OpenNLP format. Also look at [1] in case you can make use of some pre trained model available for OpenNLP HTH [0] https://opennlp.apache.org/documentation/1.7.2/manual/opennl p.html#tools.namefind.training [1] http://opennlp.sourceforge.net/models-1.5/ -- Madhav Sharan On Sun, Feb 26, 2017 at 9:42 PM, Madhvi Gupta <mgmahi....@gmail.com> wrote: > Please let me know if anyone have any idea about this > > With Regards > Madhvi Gupta > *(Senior Software Engineer)* > > On Tue, Feb 21, 2017 at 10:51 AM, Madhvi Gupta <mgmahi....@gmail.com> > wrote: > > > Hi Joern, > > > > Training data generated from reuters dataset is in the following format. > > It has generated three files eng.train, eng.testa, eng.testb. > > > > A DT I-NP O > > rare JJ I-NP O > > early JJ I-NP O > > handwritten JJ I-NP O > > draft NN I-NP O > > of IN I-PP O > > a DT I-NP O > > song NN I-NP O > > by IN I-PP O > > U.S. NNP I-NP I-LOC > > guitar NN I-NP O > > legend NN I-NP O > > Jimi NNP I-NP I-PER > > > > Using this training data file when I ran the command: > > ./opennlp TokenNameFinderTrainer -model en-ner-person.bin -lang en -data > > /home/centos/ner/eng.train -encoding UTF-8 > > > > It is giving me the following error: > > ERROR: Not enough training data > > The provided training data is not sufficient to create enough events to > > train a model. > > To resolve this error use more training data, if this doesn't help there > > might > > be some fundamental problem with the training data itself. > > > > The format required for training opennlp models is in the form of > > sentences but training data prepared from reuters dataset is in the baove > > said format. So please tell me how training data can be generated in the > > required format or how the existing training data format can be used for > > generating models. > > > > With Regards > > Madhvi Gupta > > *(Senior Software Engineer)* > > > > On Mon, Feb 20, 2017 at 5:52 PM, Joern Kottmann <kottm...@gmail.com> > > wrote: > > > >> Please explain to us what is not working. Any error messages or > >> exceptions? > >> > >> The name finder by default trains on the default format which you can > see > >> in the documentation link i shared. > >> > >> Jörn > >> > >> On Mon, Feb 20, 2017 at 6:04 AM, Madhvi Gupta <mgmahi....@gmail.com> > >> wrote: > >> > >> > Hi Joern, > >> > > >> > I have got the data from the following link which consist of corpus of > >> new > >> > articles. > >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__trec.nis > t.gov_data_reuters_reuters.html&d=DwIFaQ&c=clK7kQUTWtAVEOVIg > vi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg&m=lMnAkl > nfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=0sEQ0deDkUi3w600Svja > aKSVhtlEHEGzDh-l202X76o&e= > >> > > >> > Following the steps given in the below link I have created training > and > >> > test data but it is not working with the NameFinder of opennlp api. > >> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.clip > s.uantwerpen.be_conll2003_ner_000README&d=DwIFaQ&c=clK7kQUTW > tAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg& > m=lMnAklnfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=ijG9-HM4_WRl > wIUM6VyvE0YB3arX5Z2BVN5SFKlmzN4&e= > >> > > >> > So can you please help me how to create training data out of that > corpus > >> > and use it to create name entity detection models? > >> > > >> > With Regards > >> > Madhvi Gupta > >> > *(Senior Software Engineer)* > >> > > >> > On Mon, Feb 20, 2017 at 1:00 AM, Joern Kottmann <kottm...@gmail.com> > >> > wrote: > >> > > >> > > Hello, > >> > > > >> > > to train the name finder you need training data that contains the > >> > entities > >> > > you would like to decect. > >> > > Is that the case with the data you have? > >> > > > >> > > Take a look at our documentation: > >> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__opennlp > .apache.org_documentation_1.7.2_manual_&d=DwIFaQ&c=clK7kQUTW > tAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=DhBa2eLkbd4gAFB01lkNgg& > m=lMnAklnfFkmS3IfHhJy5PgR6CHe7-61J_5MAe3U8CJI&s=aLn09MB1cLHy > ZI9a0NT3gLdj5ZNFrR_eg_PhHHQHYC4&e= > >> > > opennlp.html#tools.namefind.training > >> > > > >> > > At the beginning of that section you can see how the data has to be > >> > marked > >> > > up. > >> > > > >> > > Please note you that you need many sentences to train the name > finder. > >> > > > >> > > HTH, > >> > > Jörn > >> > > > >> > > > >> > > On Sat, Feb 18, 2017 at 11:28 AM, Madhvi Gupta < > mgmahi....@gmail.com> > >> > > wrote: > >> > > > >> > > > Hi All, > >> > > > > >> > > > I have got reuters data from NIST. Now I want to generate the > >> training > >> > > data > >> > > > from that to create a model for detecting named entities. Can > anyone > >> > tell > >> > > > me how the models can be generated from that. > >> > > > > >> > > > -- > >> > > > With Regards > >> > > > Madhvi Gupta > >> > > > *(Senior Software Engineer)* > >> > > > > >> > > > >> > > >> > > >> > > >> > -- > >> > > >> > > > > >