Re: Speech Detection

James Kosin Mon, 25 Mar 2013 18:24:11 -0700

On 3/25/2013 11:31 AM, Riccardo Tasso wrote:

Hi, I'm trying to use OpenNLP SentenceDetector to split italian sentences
(without abbreviations) which represent speeches.


I have a quite big data-set annotated by human experts in which each
document is a line of text, segmented in one or more pieces depending on
our needs.

To better understand my case, if the line is the following:
I'm not able to play tennis - he said - You're right - replied his wife

The right segmentation should be:
I'm not able to play tennis
  - he said -
You're right
  - replied his wife

I decided to try a statistical approach to segment my text, and the
SentenceDetector seems to be the right choice to me.

I've build the training set in the format specified in
http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html#tools.sentdetect.training
which
is:

    - one segment per line
    - a blank line to separate two documents

To evaluate performance I've divided my dataset in one for training and one
for validation but the performance was quite low:
Precision: 0.4485549132947977
Recall: 0.3038371182458888
F-Measure: 0.3622782446311859

Since I've used default values I guess there should be some way to obtain
better results...or maybe do I need another model?

Thanks,
    Riccardo

Riccardo,

How many sentences, and documents in your training set?

James

Re: Speech Detection

Reply via email to