Hi Markus, Just adding the characters <CR> and <LF> to the eos array is not going to solve your problem. You would need to add <CR> and <LF> to you training set otherwise the sentence detector will ALWAYS end the sentence at <CR><LF>. Think about how the training data (including the example you gave). I think this would require OpenNLP to change the format of the sentence detector training data, so we could see <CR> and <LF> read the next word and decide whether it is an end of sentence. You would want data like:
Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of stomach cramps <LF><CR><End:Sentence> In order to catch the end-of-line as a sentence delimiter. Do you see a way around it? Comments? Daniel > On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler > <markus.kreuztha...@gmail.com> wrote: > > Hello! > > I state my problem again as I think it is quite similar to the following > issue: > https://issues.apache.org/jira/browse/OPENNLP-602 > > I work with clinical narratives so eos characters are very often just > missing, and I try to train a new robust sentence model. > From the issue above it is suggested to encode these types of endings with > <CR><LF> or just a <LF> > > How do I set this up properly? > > char[] eosCharacters = {'!','?','.'}; > SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de", > true, null ,eosCharacters); > > eosCharacters is a char array, how to put in your suggested encodings > '<CR><LF>', '<LF>'? > > How do I have to prepare my final training data set then? > So I have for example in the text something like (with an artificial line > break in the middle of the sentence): > The quick abbr. brown > fox jumps over the lazy dog > > Training: > The quick abbr. brown fox jumps over the lazy dog <CR><LF> > > If the standard eos charactes {'.','?','!'} are existing: > The quick abbr. brown > fox jumps over the lazy dog. > > Training: > The quick abbr. brown fox jumps over the lazy dog. > > If I have an abbreviation at the end of a sentence do I have to encode this > in a special way? > The quick abbr. brown > fox jumps over the lazy dog abbr. > > Training: > The quick abbr. brown fox jumps over the lazy dog abbr. > > When I have trained my model, do I have to accommodate the input text to > e.g. <CR><LF> or <LF> inputs as used in the training sentences? > > Thank you for your help! > > lg Markus