Re: Training Data Query (Newbie)

Mark G Sat, 26 Oct 2013 06:57:45 -0700

Are you trying to extract named entities (NER), or perform categorization?
If you are doing NER, then first decide what entities you want to extract,
and annotate a subset of tweets using the OpenNLP annotation guidelines for
NER. Have a look here...
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.api
the format is like this for NER training, notice the start and end tags
have a space after the first > and before the <END


<START:person> Pierre Vinken <END> , 61 years old , will join the
board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the
Dutch publishing group .

once you have created a file of sentences like above, use that file to
build a model, then use the model you created with a TokenNameFinder.

For categorization use the doccat tools or api, have a look here
opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.doccat.classifying.api
the format for this is as such: where "GMDecrease" || "GMIncrease" are
categories... so if you are doing sentiment analysis you would have "good"
or "Bad" or something and then an exemplary chunk of text

                        
GMDecrease Major acquisitions that have a lower gross margin than the
existing network also \
           had a negative impact on the overall gross margin, but it
should improve following \
           the implementation of its integration strategies .
GMIncrease The upward movement of gross margin resulted from amounts
pursuant to adjustments \
           to obligations towards dealers .

HTH

MG



On Fri, Oct 25, 2013 at 7:24 AM, <[email protected]> wrote:

> I am new to openNLP. I have the basic java code running.
>
> I want to create a training set for twitter topics. I have the Training
> API page with the sample code, but I cannot comprehend from that how to
> create and modify the training set.
>
> Can anyone help?
>
> POSModel model = null;
>
> InputStream dataIn = null;
> try {
>  dataIn = new FileInputStream("en-pos.train"**);
>  ObjectStream<String> lineStream =
>                 new PlainTextByLineStream(dataIn, "UTF-8");
>  ObjectStream<POSSample> sampleStream = new WordTagSampleStream(**
> lineStream);
>
>  model = POSTaggerME.train("en", sampleStream, ModelType.MAXENT,
>      null, null, 100, 5);
> }
> catch (IOException e) {
>  // Failed to read or parse training data, training failed
>  e.printStackTrace();
> }
> finally {
>  if (dataIn != null) {
>    try {
>      dataIn.close();
>    }
>    catch (IOException e) {
>      // Not an issue, training already finished.
>      // The exception should be logged and investigated
>      // if part of a production system.
>      e.printStackTrace();
>    }
>  }
> }
>
>
>
> ----- Original Message ----- From: "Massimo Tarantelli" <m.tarantelli@**
> innovationengineering.eu <[email protected]>>
> To: <[email protected]>
> Sent: Friday, October 25, 2013 11:47 AM
> Subject: Document categorizer model
>
>
>  Dear all,
>> does anyone has trained a Document categorizer model in english?
>> thanks
>> --
>>
>> *Massimo Tarantelli*
>>
>> _Innovation Engineering_
>> Via Napoleone Colajanni 4 (00191 Roma)
>> T +39 06 45 425 111
>> E 
>> m.tarantelli@**innovationengineering.eu<[email protected]>
>>
>>
>
>
>

Re: Training Data Query (Newbie)

Reply via email to