Dear Dan and Jörn! Thank you for your reply! So I try to continue to find the right training format.
As I understand Jörn correctly it would be: char[] eosCharacters = {'!','?','.','\n'}; SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de", true, null ,eosCharacters); >From the text (including an artificial line break after "brown"): The quick abbr. brown fox jumps over the lazy dog Training: A) The quick abbr. brown fox jumps over the lazy dog <NEW_LINE> Or B) The quick abbr. brown <NEW_LINE> fox jumps over the lazy dog <NEW_LINE> What is the right format after the update, A or B? lg Markus 2017-09-29 18:56 GMT+02:00 Dan Russ <danrus...@gmail.com>: > I am not suggesting we actually change anything. Only that it is more > complicated than adding chars to the eos array. > > Daniel > > > > On Sep 29, 2017, at 10:44 AM, Joern Kottmann <kottm...@gmail.com> wrote: > > > > I think it is a bit unlucky that we have two <LF> and <CR> tags. I > > would change this and normalize it into just one tag e.g. <NEW_LINE> > > and then allow this to be placed in our existing training format as a > > end-of-sentence marker. > > > > The eos array needs to also contain that char, we can just take /n and > > use this as a marker that we need to detect new line chars independent > > of the platform. > > > > And just to remind us all, we have this problem also in other > > components, e.g. the name finder can't take new lines into account, > > but this is obviously needed for certain data sets like a name list > > where each name is written in one line. > > > > Jörn > > > > On Fri, Sep 29, 2017 at 4:32 PM, Dan Russ <danrus...@gmail.com> wrote: > >> Hi Markus, > >> Just adding the characters <CR> and <LF> to the eos array is not > going to solve your problem. You would need to add <CR> and <LF> to you > training set otherwise the sentence detector will ALWAYS end the sentence > at <CR><LF>. Think about how the training data (including the example you > gave). I think this would require OpenNLP to change the format of the > sentence detector training data, so we could see <CR> and <LF> read the > next word and decide whether it is an end of sentence. You would want data > like: > >> > >> Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of > stomach cramps <LF><CR><End:Sentence> > >> > >> In order to catch the end-of-line as a sentence delimiter. > >> > >> Do you see a way around it? Comments? > >> Daniel > >> > >>> On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler < > markus.kreuztha...@gmail.com> wrote: > >>> > >>> Hello! > >>> > >>> I state my problem again as I think it is quite similar to the > following > >>> issue: > >>> https://issues.apache.org/jira/browse/OPENNLP-602 > >>> > >>> I work with clinical narratives so eos characters are very often just > >>> missing, and I try to train a new robust sentence model. > >>> From the issue above it is suggested to encode these types of endings > with > >>> <CR><LF> or just a <LF> > >>> > >>> How do I set this up properly? > >>> > >>> char[] eosCharacters = {'!','?','.'}; > >>> SentenceDetectorFactory sentenceFactory = new > SentenceDetectorFactory("de", > >>> true, null ,eosCharacters); > >>> > >>> eosCharacters is a char array, how to put in your suggested encodings > >>> '<CR><LF>', '<LF>'? > >>> > >>> How do I have to prepare my final training data set then? > >>> So I have for example in the text something like (with an artificial > line > >>> break in the middle of the sentence): > >>> The quick abbr. brown > >>> fox jumps over the lazy dog > >>> > >>> Training: > >>> The quick abbr. brown fox jumps over the lazy dog <CR><LF> > >>> > >>> If the standard eos charactes {'.','?','!'} are existing: > >>> The quick abbr. brown > >>> fox jumps over the lazy dog. > >>> > >>> Training: > >>> The quick abbr. brown fox jumps over the lazy dog. > >>> > >>> If I have an abbreviation at the end of a sentence do I have to encode > this > >>> in a special way? > >>> The quick abbr. brown > >>> fox jumps over the lazy dog abbr. > >>> > >>> Training: > >>> The quick abbr. brown fox jumps over the lazy dog abbr. > >>> > >>> When I have trained my model, do I have to accommodate the input text > to > >>> e.g. <CR><LF> or <LF> inputs as used in the training sentences? > >>> > >>> Thank you for your help! > >>> > >>> lg Markus > >> > >