Can we have one training set covering all types of entitles, eg places and names?

----- Original Message ----- From: "Mark G" <[email protected]>
To: <[email protected]>
Sent: Saturday, October 26, 2013 2:56 PM
Subject: SPAM-HIGH: Re: Training Data Query (Newbie)


Are you trying to extract named entities (NER), or perform categorization?
If you are doing NER, then first decide what entities you want to extract,
and annotate a subset of tweets using the OpenNLP annotation guidelines for
NER. Have a look here...
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.api
the format is like this for NER training, notice the start and end tags
have a space after the first > and before the <END

<START:person> Pierre Vinken <END> , 61 years old , will join the
board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the
Dutch publishing group .

once you have created a file of sentences like above, use that file to
build a model, then use the model you created with a TokenNameFinder.

For categorization use the doccat tools or api, have a look here
opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.doccat.classifying.api
the format for this is as such: where "GMDecrease" || "GMIncrease" are
categories... so if you are doing sentiment analysis you would have "good"
or "Bad" or something and then an exemplary chunk of text


GMDecrease Major acquisitions that have a lower gross margin than the
existing network also \
          had a negative impact on the overall gross margin, but it
should improve following \
          the implementation of its integration strategies .
GMIncrease The upward movement of gross margin resulted from amounts
pursuant to adjustments \
          to obligations towards dealers .

HTH

MG



On Fri, Oct 25, 2013 at 7:24 AM, <[email protected]> wrote:

I am new to openNLP. I have the basic java code running.

I want to create a training set for twitter topics. I have the Training
API page with the sample code, but I cannot comprehend from that how to
create and modify the training set.

Can anyone help?

POSModel model = null;

InputStream dataIn = null;
try {
 dataIn = new FileInputStream("en-pos.train"**);
 ObjectStream<String> lineStream =
                new PlainTextByLineStream(dataIn, "UTF-8");
 ObjectStream<POSSample> sampleStream = new WordTagSampleStream(**
lineStream);

 model = POSTaggerME.train("en", sampleStream, ModelType.MAXENT,
     null, null, 100, 5);
}
catch (IOException e) {
 // Failed to read or parse training data, training failed
 e.printStackTrace();
}
finally {
 if (dataIn != null) {
   try {
     dataIn.close();
   }
   catch (IOException e) {
     // Not an issue, training already finished.
     // The exception should be logged and investigated
     // if part of a production system.
     e.printStackTrace();
   }
 }
}



----- Original Message ----- From: "Massimo Tarantelli" <m.tarantelli@**
innovationengineering.eu <[email protected]>>
To: <[email protected]>
Sent: Friday, October 25, 2013 11:47 AM
Subject: Document categorizer model


 Dear all,
does anyone has trained a Document categorizer model in english?
thanks
--

*Massimo Tarantelli*

_Innovation Engineering_
Via Napoleone Colajanni 4 (00191 Roma)
T +39 06 45 425 111
E m.tarantelli@**innovationengineering.eu<[email protected]>









Reply via email to