I train the model on a sample stream with many sentences, one per line. The single sentence is just a trivial test example to See if abbreviations work.
model = trainer.train(language, sampleStream, fact, trainingParameters); It seems like I have to define an abbreviation in the dictionary and examples in the training data for this to work. In which case I'm not clear what the abbreviations dictionary actually does. -----Original Message----- From: Daniel Russ [mailto:dr...@apache.org] Sent: Wednesday, September 6, 2017 9:51 AM To: users@opennlp.apache.org Subject: Re: How do abbreviations work when training a sentence detector You are trying to train a sentence detector with only 1 sentence. Each line should be 1 sentence, the final character in the line marks the EOS. It should handle abbreviations correctly. The idea behind the S.D. is that every period (or ? or ! ) is classified as EOS or notEOS. Daniel Please see: http://opennlp.apache.org/docs/1.8.1/manual/opennlp.html#tools.sentdetect <http://opennlp.apache.org/docs/1.8.1/manual/opennlp.html#tools.sentdetect> for more info. > On Sep 6, 2017, at 12:21 PM, Ade Miller <ade.mil...@getconga.com> wrote: > > I'm trying to train a sentence detector with a set of abbreviations but am > not seeing the behavior I expected. > > InputStreamFactory factory = new > MarkableFileInputStreamFactory(trainingData); > PlainTextByLineStream lineStream = new PlainTextByLineStream(factory, > Constants.CHARSET); > ObjectStream<SentenceSample> sampleStream = new > SentenceSampleStream(lineStream); > > Dictionary abbreviations = new AbbreviationsResourceLoader().load(); > SentenceDetectorFactory fact = new SentenceDetectorFactory(language, > true, abbreviations, null); > model = trainer.train(language, sampleStream, fact, > trainingParameters); > > CustomSentenceDetectorME detect = new CustomSentenceDetectorME(model); > String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat on the > mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog, well, it lay > in Mrs. Smythe's yard."); > for (String s : sentences) { > LOG.info(s); > } > > The output I get shows that sentences are being split on the abbreviations: > > The cat, Ms. > , sat on the mat. > I called 464-6859 ext. > 13 and asked for Mr. > Frank. > The dog, well, it lay in Mrs. > Smythe's yard. > > How is the abbreviation dictionary used? Does the training set also have to > include examples of the same abbreviation(s). > > Thanks, > > Ade