I'm trying to train a sentence detector with a set of abbreviations but am not seeing the behavior I expected.
InputStreamFactory factory = new MarkableFileInputStreamFactory(trainingData); PlainTextByLineStream lineStream = new PlainTextByLineStream(factory, Constants.CHARSET); ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream); Dictionary abbreviations = new AbbreviationsResourceLoader().load(); SentenceDetectorFactory fact = new SentenceDetectorFactory(language, true, abbreviations, null); model = trainer.train(language, sampleStream, fact, trainingParameters); CustomSentenceDetectorME detect = new CustomSentenceDetectorME(model); String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat on the mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog, well, it lay in Mrs. Smythe's yard."); for (String s : sentences) { LOG.info(s); } The output I get shows that sentences are being split on the abbreviations: The cat, Ms. , sat on the mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog, well, it lay in Mrs. Smythe's yard. How is the abbreviation dictionary used? Does the training set also have to include examples of the same abbreviation(s). Thanks, Ade