I'm trying to train a sentence detector with a set of abbreviations but am not 
seeing the behavior I expected.

        InputStreamFactory factory = new 
MarkableFileInputStreamFactory(trainingData);
        PlainTextByLineStream lineStream = new PlainTextByLineStream(factory, 
Constants.CHARSET);
        ObjectStream<SentenceSample> sampleStream = new 
SentenceSampleStream(lineStream);

        Dictionary abbreviations = new AbbreviationsResourceLoader().load();
        SentenceDetectorFactory fact = new SentenceDetectorFactory(language, 
true, abbreviations, null);
        model = trainer.train(language, sampleStream, fact, trainingParameters);

        CustomSentenceDetectorME detect = new CustomSentenceDetectorME(model);
        String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat on the 
mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog, well, it lay 
in Mrs. Smythe's yard.");
        for (String s : sentences) {
            LOG.info(s);
        }

The output I get shows that sentences are being split on the abbreviations:

The cat, Ms.
, sat on the mat.
I called 464-6859 ext.
13 and asked for Mr.
Frank.
The dog, well, it lay in Mrs.
Smythe's yard.

How is the abbreviation dictionary used? Does the training set also have to 
include examples of the same abbreviation(s).

Thanks,

Ade

Reply via email to