Re: Sentence Rules vs. Models

David Buttler Fri, 14 Dec 2007 10:09:40 -0800

Sentences are a lot more difficult to split than you would first think.Statistical models seem like overkill at first, but there are so manyinstances where a rule based system fails that eventually you will beled to the more complex models. Of course, I am speaking in generalterms and your case may be a bit different depending on your domain. Itis possible that a simple rule-based sentence splitter would work justfine. So, go ahead and write a rule-based sentence splitter (or googleone up).For tokenization, you may run into similar issues. Although, if you arelooking for a reasonable rule-based tokenizer, I once ran across areference to JTokenizer, although I don't have a URL handy. I think itis on Google Code.


Dave


jonathan doklovic wrote:

Hi,

I've been playing around with the opennlp wrappers and will probably
make use of the entity detection, but I was wondering about the sentence
and token detection.

It seems that a model (statistical) based approach may be overkill and
more of a pain to correct errors in.

I was wondering if there's any reason not to use a rule based
sentence/token detector that then feeds the opennlp pos and entity model
based annotators?

Any thought are welcome.

- Jonathan

Re: Sentence Rules vs. Models

Reply via email to