Are you trying to extract named entities (NER), or perform categorization?
If you are doing NER, then first decide what entities you want to extract,
and annotate a subset of tweets using the OpenNLP annotation guidelines for
NER. Have a look here...
http://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.namefind.training.api
the format is like this for NER training, notice the start and end tags
have a space after the first > and before the <END
<START:person> Pierre Vinken <END> , 61 years old , will join the
board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the
Dutch publishing group .
once you have created a file of sentences like above, use that file to
build a model, then use the model you created with a TokenNameFinder.
For categorization use the doccat tools or api, have a look here
opennlp.apache.org/documentation/1.5.3/manual/opennlp.html#tools.doccat.classifying.api
the format for this is as such: where "GMDecrease" || "GMIncrease" are
categories... so if you are doing sentiment analysis you would have "good"
or "Bad" or something and then an exemplary chunk of text
GMDecrease Major acquisitions that have a lower gross margin than the
existing network also \
had a negative impact on the overall gross margin, but it
should improve following \
the implementation of its integration strategies .
GMIncrease The upward movement of gross margin resulted from amounts
pursuant to adjustments \
to obligations towards dealers .
HTH
MG
On Fri, Oct 25, 2013 at 7:24 AM, <[email protected]> wrote:
> I am new to openNLP. I have the basic java code running.
>
> I want to create a training set for twitter topics. I have the Training
> API page with the sample code, but I cannot comprehend from that how to
> create and modify the training set.
>
> Can anyone help?
>
> POSModel model = null;
>
> InputStream dataIn = null;
> try {
> dataIn = new FileInputStream("en-pos.train"**);
> ObjectStream<String> lineStream =
> new PlainTextByLineStream(dataIn, "UTF-8");
> ObjectStream<POSSample> sampleStream = new WordTagSampleStream(**
> lineStream);
>
> model = POSTaggerME.train("en", sampleStream, ModelType.MAXENT,
> null, null, 100, 5);
> }
> catch (IOException e) {
> // Failed to read or parse training data, training failed
> e.printStackTrace();
> }
> finally {
> if (dataIn != null) {
> try {
> dataIn.close();
> }
> catch (IOException e) {
> // Not an issue, training already finished.
> // The exception should be logged and investigated
> // if part of a production system.
> e.printStackTrace();
> }
> }
> }
>
>
>
> ----- Original Message ----- From: "Massimo Tarantelli" <m.tarantelli@**
> innovationengineering.eu <[email protected]>>
> To: <[email protected]>
> Sent: Friday, October 25, 2013 11:47 AM
> Subject: Document categorizer model
>
>
> Dear all,
>> does anyone has trained a Document categorizer model in english?
>> thanks
>> --
>>
>> *Massimo Tarantelli*
>>
>> _Innovation Engineering_
>> Via Napoleone Colajanni 4 (00191 Roma)
>> T +39 06 45 425 111
>> E
>> m.tarantelli@**innovationengineering.eu<[email protected]>
>>
>>
>
>
>