Hi Markus,
   Just adding the characters <CR> and <LF> to the eos array is not going to 
solve your problem.  You would need to add <CR> and <LF> to you training set 
otherwise the sentence detector will ALWAYS end the sentence at <CR><LF>.  
Think about how the training data (including the example you gave).  I think 
this would require OpenNLP to change the format of the sentence detector 
training data, so we could see <CR> and <LF> read the next word and decide 
whether it is an end of sentence.  You would want data like:

Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of stomach 
cramps   <LF><CR><End:Sentence>

In order to catch the end-of-line as a sentence delimiter.

Do you see a way around it?  Comments?
Daniel

> On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler 
> <markus.kreuztha...@gmail.com> wrote:
> 
> Hello!
> 
> I state my problem again as I think it is quite similar to the following
> issue:
> https://issues.apache.org/jira/browse/OPENNLP-602
> 
> I work with clinical narratives so eos characters are very often just
> missing, and I try to train a new robust sentence model.
> From the issue above it is suggested to encode these types of endings with
> <CR><LF> or just a <LF>
> 
> How do I set this up properly?
> 
> char[] eosCharacters = {'!','?','.'};
> SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
> true, null ,eosCharacters);
> 
> eosCharacters is a char array, how to put in your suggested encodings
> '<CR><LF>', '<LF>'?
> 
> How do I have to prepare my final training data set then?
> So I have for example in the text something like (with an artificial line
> break in the middle of the sentence):
> The quick abbr. brown
> fox jumps over the lazy dog
> 
> Training:
> The quick abbr. brown fox jumps over the lazy dog <CR><LF>
> 
> If the standard eos charactes {'.','?','!'} are existing:
> The quick abbr. brown
> fox jumps over the lazy dog.
> 
> Training:
> The quick abbr. brown fox jumps over the lazy dog.
> 
> If I have an abbreviation at the end of a sentence do I have to encode this
> in a special way?
> The quick abbr. brown
> fox jumps over the lazy dog abbr.
> 
> Training:
> The quick abbr. brown fox jumps over the lazy dog abbr.
> 
> When I have trained my model, do I have to accommodate the input text to
> e.g. <CR><LF> or <LF> inputs as used in the training sentences?
> 
> Thank you for your help!
> 
> lg Markus

Reply via email to