You might try the sentence boundary detector from the International Components for Unicode project:
http://icu-project.org/userguide/boundaryAnalysis.html This implements the Unicode Standard Annex 29 rules (expressed as regular expressions). This also detects boundaries for characters, words, and lines. I haven't tried it myself, so I don't know how well it works. However, they do say in their documentation that these are relatively simple rules and some applications may require more sophisticated linguistic analysis. On the other hand, the rules cover many languages. Greg Holmberg -------------- Original message ---------------------- From: jonathan doklovic <[EMAIL PROTECTED]> > Hi, > > I've been playing around with the opennlp wrappers and will probably > make use of the entity detection, but I was wondering about the sentence > and token detection. > > It seems that a model (statistical) based approach may be overkill and > more of a pain to correct errors in. > > I was wondering if there's any reason not to use a rule based > sentence/token detector that then feeds the opennlp pos and entity model > based annotators? > > Any thought are welcome. > > - Jonathan
