Looking again to your sample, I believe you won't be able have good results using OpenNLP standard learnable Sentence Detector, and maybe any other ready to use tool. Your segmentation relies on some language knowledge that is hidden at this level of processing. Maybe you will have to combine sentence segmentation with POS tagging, or clause categorization to have good results.
On Tue, Mar 26, 2013 at 10:30 AM, Jörn Kottmann <[email protected]> wrote: > Hello, > > the sentence detector only considers EOS chars as potential > sentence boundaries, it should not be difficult to extend/modify it so > that locations detected by user code are used for the split decision. > > The iterations specify the maximum number of iterations for an iterative > machine learning algorithm, and cutoff removes features which did not > occur at least n times in the training data. > > Jörn > > > On 03/26/2013 01:52 PM, Riccardo Tasso wrote: > >> Thank you Jörn, in fact the results improved a lot: >> Precision: 0.5325131810193322 >> Recall: 0.4745497259201253 >> F-Measure: 0.5018633540372671 >> >> I guess the splitter could have better results if it were able to detect >> parenthetic structure such as: >> some text - speech - other text >> which in my dataset is splitted as: >> some text >> - speech - >> other text >> Is it possible? >> >> Another optimization should be the one which could detect symbols to end a >> sentence longer than one character, for example "...". >> >> Can you tell me more about the following parameters? >> >> - iterations >> - cutoff >> >> Is there any guideline on how tune them? >> >> Cheers, >> Riccardo >> >> >> >> 2013/3/26 Jörn Kottmann <[email protected]> >> >> On 03/26/2013 08:40 AM, Riccardo Tasso wrote: >>> >>> Is the Sentence Detector able to split also on non dot characters? In my >>>> case there should be also other characters delimiting the end of a >>>> segment, >>>> such as: colon (:), dash (-), various kind of quotation marks (", `, ', >>>> ...). >>>> >>>> The Sentence Detector can only split on end-of-sentence characters, by >>> default these >>> are . ! ? but with 1.5.3 you can set them during training to your custom >>> set, there is >>> a command line argument for it on the Sentence Detector Trainer, haver a >>> look at the help. >>> >>> If you don't want to compile yourself use the 1.5.3 RC2 which we are >>> currently testing. >>> >>> Jörn >>> >>> >>> >>> >
