Sentences are a lot more difficult to split than you would first think.
Statistical models seem like overkill at first, but there are so many
instances where a rule based system fails that eventually you will be
led to the more complex models. Of course, I am speaking in general
terms and your case may be a bit different depending on your domain. It
is possible that a simple rule-based sentence splitter would work just
fine. So, go ahead and write a rule-based sentence splitter (or google
one up).
For tokenization, you may run into similar issues. Although, if you are
looking for a reasonable rule-based tokenizer, I once ran across a
reference to JTokenizer, although I don't have a URL handy. I think it
is on Google Code.
Dave
jonathan doklovic wrote:
Hi,
I've been playing around with the opennlp wrappers and will probably
make use of the entity detection, but I was wondering about the sentence
and token detection.
It seems that a model (statistical) based approach may be overkill and
more of a pain to correct errors in.
I was wondering if there's any reason not to use a rule based
sentence/token detector that then feeds the opennlp pos and entity model
based annotators?
Any thought are welcome.
- Jonathan