Sentences are a lot more difficult to split than you would first think. Statistical models seem like overkill at first, but there are so many instances where a rule based system fails that eventually you will be led to the more complex models. Of course, I am speaking in general terms and your case may be a bit different depending on your domain. It is possible that a simple rule-based sentence splitter would work just fine. So, go ahead and write a rule-based sentence splitter (or google one up). For tokenization, you may run into similar issues. Although, if you are looking for a reasonable rule-based tokenizer, I once ran across a reference to JTokenizer, although I don't have a URL handy. I think it is on Google Code.

Dave

jonathan doklovic wrote:
Hi,

I've been playing around with the opennlp wrappers and will probably
make use of the entity detection, but I was wondering about the sentence
and token detection.

It seems that a model (statistical) based approach may be overkill and
more of a pain to correct errors in.

I was wondering if there's any reason not to use a rule based
sentence/token detector that then feeds the opennlp pos and entity model
based annotators?

Any thought are welcome.

- Jonathan

Reply via email to