I train the model on a sample stream with many sentences, one per line. The 
single sentence is just a trivial test example to 
See if abbreviations work.

model = trainer.train(language, sampleStream, fact, trainingParameters);

It seems like I have to define an abbreviation in the dictionary and examples 
in the training data for this to work. In which case I'm not clear what the 
abbreviations dictionary actually does.

-----Original Message-----
From: Daniel Russ [mailto:dr...@apache.org] 
Sent: Wednesday, September 6, 2017 9:51 AM
To: users@opennlp.apache.org
Subject: Re: How do abbreviations work when training a sentence detector

You are trying to train a sentence detector with only 1 sentence.    Each line 
should be 1 sentence, the final character in the line marks the EOS.  It should 
handle abbreviations correctly.  The idea behind the S.D. is that every period 
(or ? or ! ) is classified as EOS or notEOS.
Daniel  

Please see: 
http://opennlp.apache.org/docs/1.8.1/manual/opennlp.html#tools.sentdetect 
<http://opennlp.apache.org/docs/1.8.1/manual/opennlp.html#tools.sentdetect>  
for more info.


> On Sep 6, 2017, at 12:21 PM, Ade Miller <ade.mil...@getconga.com> wrote:
> 
> I'm trying to train a sentence detector with a set of abbreviations but am 
> not seeing the behavior I expected.
> 
>        InputStreamFactory factory = new 
> MarkableFileInputStreamFactory(trainingData);
>        PlainTextByLineStream lineStream = new PlainTextByLineStream(factory, 
> Constants.CHARSET);
>        ObjectStream<SentenceSample> sampleStream = new 
> SentenceSampleStream(lineStream);
> 
>        Dictionary abbreviations = new AbbreviationsResourceLoader().load();
>        SentenceDetectorFactory fact = new SentenceDetectorFactory(language, 
> true, abbreviations, null);
>        model = trainer.train(language, sampleStream, fact, 
> trainingParameters);
> 
>        CustomSentenceDetectorME detect = new CustomSentenceDetectorME(model);
>        String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat on the 
> mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog, well, it lay 
> in Mrs. Smythe's yard.");
>        for (String s : sentences) {
>            LOG.info(s);
>        }
> 
> The output I get shows that sentences are being split on the abbreviations:
> 
> The cat, Ms.
> , sat on the mat.
> I called 464-6859 ext.
> 13 and asked for Mr.
> Frank.
> The dog, well, it lay in Mrs.
> Smythe's yard.
> 
> How is the abbreviation dictionary used? Does the training set also have to 
> include examples of the same abbreviation(s).
> 
> Thanks,
> 
> Ade

Reply via email to