Re: custom eos characters

Joern Kottmann Fri, 29 Sep 2017 07:45:42 -0700

I think it is a bit unlucky that we have two <LF> and <CR> tags. I
would change this and normalize it into just one tag e.g. <NEW_LINE>
and then allow this to be placed in our existing training format as a
end-of-sentence marker.


The eos array needs to also contain that char, we can just take /n and
use this as a marker that we need to detect new line chars independent
of the platform.

And just to remind us all, we have this problem also in other
components, e.g. the name finder can't take new lines into account,
but this is obviously needed for certain data sets like a name list
where each name is written in one line.

Jörn

On Fri, Sep 29, 2017 at 4:32 PM, Dan Russ <danrus...@gmail.com> wrote:
> Hi Markus,
>    Just adding the characters <CR> and <LF> to the eos array is not going to 
> solve your problem.  You would need to add <CR> and <LF> to you training set 
> otherwise the sentence detector will ALWAYS end the sentence at <CR><LF>.  
> Think about how the training data (including the example you gave).  I think 
> this would require OpenNLP to change the format of the sentence detector 
> training data, so we could see <CR> and <LF> read the next word and decide 
> whether it is an end of sentence.  You would want data like:
>
> Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of stomach 
> cramps   <LF><CR><End:Sentence>
>
> In order to catch the end-of-line as a sentence delimiter.
>
> Do you see a way around it?  Comments?
> Daniel
>
>> On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler 
>> <markus.kreuztha...@gmail.com> wrote:
>>
>> Hello!
>>
>> I state my problem again as I think it is quite similar to the following
>> issue:
>> https://issues.apache.org/jira/browse/OPENNLP-602
>>
>> I work with clinical narratives so eos characters are very often just
>> missing, and I try to train a new robust sentence model.
>> From the issue above it is suggested to encode these types of endings with
>> <CR><LF> or just a <LF>
>>
>> How do I set this up properly?
>>
>> char[] eosCharacters = {'!','?','.'};
>> SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
>> true, null ,eosCharacters);
>>
>> eosCharacters is a char array, how to put in your suggested encodings
>> '<CR><LF>', '<LF>'?
>>
>> How do I have to prepare my final training data set then?
>> So I have for example in the text something like (with an artificial line
>> break in the middle of the sentence):
>> The quick abbr. brown
>> fox jumps over the lazy dog
>>
>> Training:
>> The quick abbr. brown fox jumps over the lazy dog <CR><LF>
>>
>> If the standard eos charactes {'.','?','!'} are existing:
>> The quick abbr. brown
>> fox jumps over the lazy dog.
>>
>> Training:
>> The quick abbr. brown fox jumps over the lazy dog.
>>
>> If I have an abbreviation at the end of a sentence do I have to encode this
>> in a special way?
>> The quick abbr. brown
>> fox jumps over the lazy dog abbr.
>>
>> Training:
>> The quick abbr. brown fox jumps over the lazy dog abbr.
>>
>> When I have trained my model, do I have to accommodate the input text to
>> e.g. <CR><LF> or <LF> inputs as used in the training sentences?
>>
>> Thank you for your help!
>>
>> lg Markus
>

Re: custom eos characters

Reply via email to