Thank you.

I've decided to try also a simpler rule based approach and it performs
quite well.
Anyway this discussion was very useful to me.

Cheers,
   Riccardo


2013/3/29 William Colen <[email protected]>

> Looking again to your sample, I believe you won't be able have good results
> using OpenNLP standard learnable Sentence Detector, and maybe any other
> ready to use tool. Your segmentation relies on some language knowledge that
> is hidden at this level of processing. Maybe you will have to combine
> sentence segmentation with POS tagging, or clause categorization to have
> good results.
>
> On Tue, Mar 26, 2013 at 10:30 AM, Jörn Kottmann <[email protected]>
> wrote:
>
> > Hello,
> >
> > the sentence detector only considers EOS chars as potential
> > sentence boundaries, it should not be difficult to extend/modify it so
> > that locations detected by user code are used for the split decision.
> >
> > The iterations specify the maximum number of iterations for an iterative
> > machine learning algorithm, and cutoff removes features which did not
> > occur at least n times in the training data.
> >
> > Jörn
> >
> >
> > On 03/26/2013 01:52 PM, Riccardo Tasso wrote:
> >
> >> Thank you Jörn, in fact the results improved a lot:
> >> Precision: 0.5325131810193322
> >> Recall: 0.4745497259201253
> >> F-Measure: 0.5018633540372671
> >>
> >> I guess the splitter could have better results if it were able to detect
> >> parenthetic structure such as:
> >> some text - speech - other text
> >> which in my dataset is splitted as:
> >> some text
> >> - speech -
> >> other text
> >> Is it possible?
> >>
> >> Another optimization should be the one which could detect symbols to
> end a
> >> sentence longer than one character, for example "...".
> >>
> >> Can you tell me more about the following parameters?
> >>
> >>     - iterations
> >>     - cutoff
> >>
> >> Is there any guideline on how tune them?
> >>
> >> Cheers,
> >> Riccardo
> >>
> >>
> >>
> >> 2013/3/26 Jörn Kottmann <[email protected]>
> >>
> >>  On 03/26/2013 08:40 AM, Riccardo Tasso wrote:
> >>>
> >>>  Is the Sentence Detector able to split also on non dot characters? In
> my
> >>>> case there should be also other characters delimiting the end of a
> >>>> segment,
> >>>> such as: colon (:), dash (-), various kind of quotation marks (", `,
> ',
> >>>> ...).
> >>>>
> >>>>  The Sentence Detector can only split on end-of-sentence characters,
> by
> >>> default these
> >>> are . ! ? but with 1.5.3 you can set them during training to your
> custom
> >>> set, there is
> >>> a command line argument for it on the Sentence Detector Trainer, haver
> a
> >>> look at the help.
> >>>
> >>> If you don't want to compile yourself use the 1.5.3 RC2 which we are
> >>> currently testing.
> >>>
> >>> Jörn
> >>>
> >>>
> >>>
> >>>
> >
>

Reply via email to