Different algorithms use different formats. In this patch to Solr, there is sample test data for the sentence detector, tokenizer, parts-of-speech, chunker, and named entity recognition tools.
https://issues.apache.org/jira/browse/LUCENE-2899 Unpack this as a patch directory, and ignore all of the "file not found" messages. In solr/contrib/opennlp/src/test-files/training/bin/trainall.sh is a shell script that turns all of these training files into miniature models. I created these only to create unit tests that work. There is no real training data here, but the files do show the formats. Lance ----- Original Message ----- | From: "yuelin.sha" <[email protected]> | To: "users" <[email protected]> | Sent: Wednesday, October 24, 2012 9:16:38 PM | Subject: Re: Re: Can one add trained data into a existing model? | | Hello, James | your reply is very helpful. I will try to train a new model. but for | now I haven't found much info on about the format of train data for | opennlp. Can you give me some advises to let me know the train data | format? | | 2012-10-25 | | | | yuelin.sha | | | | 发件人:James Kosin | 发送时间:2012-10-25 11:51 | 主题:Re: Can one add trained data into a existing model? | 收件人:"users"<[email protected]> | 抄送: | | Hello Juelin Sha, | | 1) Adding data to an existing model is not a trivial process. It | also | isn't supported, or suggested you try or attempt to do so. | | 2) The sentence detection model is really quite easy. And should be | easily trained on only a few hundred really good samples of data. | I've | got my own sentence detector model trained on only about 80 sentences | and the model performs well for what I use it for. I of course watch | it | and add new sentences as I find issues. | | 3) If you really need the detection in the final model, you may | donate | your sentences to the project (I hope you this means you are the | owner | of the sentences and that they are your work alone... copyright | issues). A single example sentence isn't going to be enough | though... | all the models are trained with several thousand sentences and have a | default cut-off of 5 when training to eliminate rare exceptions. | But, | for sentence detection it is mainly on the punctuation that | determines | the end of the sentence. | | Hopefully this helps answer some of your questions. | | James | | On 10/24/2012 11:06 PM, yuelin.sha wrote: | > Hello everyone, | > | > We have bean using the offical english model for sentence | > detection. but recently we want to train some more data into the | > offical model. Is someone can give me a tip for this? I don't know | > is such a work supported by opennlp. | > | > thinks in advance. | > | > 2012-10-25 | > | > | > | > yuelin.sha | > | > -------------------------- | > Information contained in this e-mail (and any attachments) is | > confidential and is intended for exclusive disclosure to the | > addressee(s). Any unauthorized disclosure, reproduction, | > distribution or other dissemination or use of this communication | > is prohibited. If you have received this communication in error, | > please delete this communication and contact us by replying to | > this e-mail immediately. | > | | -------------------------- | Information contained in this e-mail (and any attachments) is | confidential and is intended for exclusive disclosure to the | addressee(s). Any unauthorized disclosure, reproduction, | distribution or other dissemination or use of this communication is | prohibited. If you have received this communication in error, please | delete this communication and contact us by replying to this e-mail | immediately. | |
