Re: custom eos characters

Markus Kreuzthaler Sat, 30 Sep 2017 03:29:08 -0700

Dear Dan and Jörn!

Thank you for your reply!
So I try to continue to find the right training format.


As I understand Jörn correctly it would be:

char[] eosCharacters = {'!','?','.','\n'};
SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
true, null ,eosCharacters);

>From the text (including an artificial line break after "brown"):
The quick abbr. brown
fox jumps over the lazy dog

Training:
A) The quick abbr. brown fox jumps over the lazy dog <NEW_LINE>
Or
B) The quick abbr. brown <NEW_LINE> fox jumps over the lazy dog <NEW_LINE>

What is the right format after the update, A or B?

lg Markus


2017-09-29 18:56 GMT+02:00 Dan Russ <danrus...@gmail.com>:

> I am not suggesting we actually change anything.  Only that it is more
> complicated than adding chars to the eos array.
>
> Daniel
>
>
> > On Sep 29, 2017, at 10:44 AM, Joern Kottmann <kottm...@gmail.com> wrote:
> >
> > I think it is a bit unlucky that we have two <LF> and <CR> tags. I
> > would change this and normalize it into just one tag e.g. <NEW_LINE>
> > and then allow this to be placed in our existing training format as a
> > end-of-sentence marker.
> >
> > The eos array needs to also contain that char, we can just take /n and
> > use this as a marker that we need to detect new line chars independent
> > of the platform.
> >
> > And just to remind us all, we have this problem also in other
> > components, e.g. the name finder can't take new lines into account,
> > but this is obviously needed for certain data sets like a name list
> > where each name is written in one line.
> >
> > Jörn
> >
> > On Fri, Sep 29, 2017 at 4:32 PM, Dan Russ <danrus...@gmail.com> wrote:
> >> Hi Markus,
> >>   Just adding the characters <CR> and <LF> to the eos array is not
> going to solve your problem.  You would need to add <CR> and <LF> to you
> training set otherwise the sentence detector will ALWAYS end the sentence
> at <CR><LF>.  Think about how the training data (including the example you
> gave).  I think this would require OpenNLP to change the format of the
> sentence detector training data, so we could see <CR> and <LF> read the
> next word and decide whether it is an end of sentence.  You would want data
> like:
> >>
> >> Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of
> stomach cramps   <LF><CR><End:Sentence>
> >>
> >> In order to catch the end-of-line as a sentence delimiter.
> >>
> >> Do you see a way around it?  Comments?
> >> Daniel
> >>
> >>> On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler <
> markus.kreuztha...@gmail.com> wrote:
> >>>
> >>> Hello!
> >>>
> >>> I state my problem again as I think it is quite similar to the
> following
> >>> issue:
> >>> https://issues.apache.org/jira/browse/OPENNLP-602
> >>>
> >>> I work with clinical narratives so eos characters are very often just
> >>> missing, and I try to train a new robust sentence model.
> >>> From the issue above it is suggested to encode these types of endings
> with
> >>> <CR><LF> or just a <LF>
> >>>
> >>> How do I set this up properly?
> >>>
> >>> char[] eosCharacters = {'!','?','.'};
> >>> SentenceDetectorFactory sentenceFactory = new
> SentenceDetectorFactory("de",
> >>> true, null ,eosCharacters);
> >>>
> >>> eosCharacters is a char array, how to put in your suggested encodings
> >>> '<CR><LF>', '<LF>'?
> >>>
> >>> How do I have to prepare my final training data set then?
> >>> So I have for example in the text something like (with an artificial
> line
> >>> break in the middle of the sentence):
> >>> The quick abbr. brown
> >>> fox jumps over the lazy dog
> >>>
> >>> Training:
> >>> The quick abbr. brown fox jumps over the lazy dog <CR><LF>
> >>>
> >>> If the standard eos charactes {'.','?','!'} are existing:
> >>> The quick abbr. brown
> >>> fox jumps over the lazy dog.
> >>>
> >>> Training:
> >>> The quick abbr. brown fox jumps over the lazy dog.
> >>>
> >>> If I have an abbreviation at the end of a sentence do I have to encode
> this
> >>> in a special way?
> >>> The quick abbr. brown
> >>> fox jumps over the lazy dog abbr.
> >>>
> >>> Training:
> >>> The quick abbr. brown fox jumps over the lazy dog abbr.
> >>>
> >>> When I have trained my model, do I have to accommodate the input text
> to
> >>> e.g. <CR><LF> or <LF> inputs as used in the training sentences?
> >>>
> >>> Thank you for your help!
> >>>
> >>> lg Markus
> >>
>
>

Re: custom eos characters

Reply via email to