Hello!

I state my problem again as I think it is quite similar to the following
issue:
https://issues.apache.org/jira/browse/OPENNLP-602

I work with clinical narratives so eos characters are very often just
missing, and I try to train a new robust sentence model.
>From the issue above it is suggested to encode these types of endings with
<CR><LF> or just a <LF>

How do I set this up properly?

char[] eosCharacters = {'!','?','.'};
SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
true, null ,eosCharacters);

eosCharacters is a char array, how to put in your suggested encodings
'<CR><LF>', '<LF>'?

How do I have to prepare my final training data set then?
So I have for example in the text something like (with an artificial line
break in the middle of the sentence):
The quick abbr. brown
fox jumps over the lazy dog

Training:
The quick abbr. brown fox jumps over the lazy dog <CR><LF>

If the standard eos charactes {'.','?','!'} are existing:
The quick abbr. brown
fox jumps over the lazy dog.

Training:
The quick abbr. brown fox jumps over the lazy dog.

If I have an abbreviation at the end of a sentence do I have to encode this
in a special way?
The quick abbr. brown
fox jumps over the lazy dog abbr.

Training:
The quick abbr. brown fox jumps over the lazy dog abbr.

When I have trained my model, do I have to accommodate the input text to
e.g. <CR><LF> or <LF> inputs as used in the training sentences?

Thank you for your help!

lg Markus

Reply via email to