Hi, I'm trying to use OpenNLP SentenceDetector to split italian sentences
(without abbreviations) which represent speeches.

I have a quite big data-set annotated by human experts in which each
document is a line of text, segmented in one or more pieces depending on
our needs.

To better understand my case, if the line is the following:
I'm not able to play tennis - he said - You're right - replied his wife

The right segmentation should be:
I'm not able to play tennis
 - he said -
You're right
 - replied his wife

I decided to try a statistical approach to segment my text, and the
SentenceDetector seems to be the right choice to me.

I've build the training set in the format specified in
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect.training
which
is:

   - one segment per line
   - a blank line to separate two documents

To evaluate performance I've divided my dataset in one for training and one
for validation but the performance was quite low:
Precision: 0.4485549132947977
Recall: 0.3038371182458888
F-Measure: 0.3622782446311859

Since I've used default values I guess there should be some way to obtain
better results...or maybe do I need another model?

Thanks,
   Riccardo

Reply via email to