On 3/25/2013 11:31 AM, Riccardo Tasso wrote:
Hi, I'm trying to use OpenNLP SentenceDetector to split italian sentences
(without abbreviations) which represent speeches.
I have a quite big data-set annotated by human experts in which each
document is a line of text, segmented in one or more pieces depending on
our needs.
To better understand my case, if the line is the following:
I'm not able to play tennis - he said - You're right - replied his wife
The right segmentation should be:
I'm not able to play tennis
- he said -
You're right
- replied his wife
I decided to try a statistical approach to segment my text, and the
SentenceDetector seems to be the right choice to me.
I've build the training set in the format specified in
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect.training
which
is:
- one segment per line
- a blank line to separate two documents
To evaluate performance I've divided my dataset in one for training and one
for validation but the performance was quite low:
Precision: 0.4485549132947977
Recall: 0.3038371182458888
F-Measure: 0.3622782446311859
Since I've used default values I guess there should be some way to obtain
better results...or maybe do I need another model?
Thanks,
Riccardo
Riccardo,
How many sentences, and documents in your training set?
James