Some other things to think about. 1) What languages? It is possible that the language(s) you're interested in may have simple or complex rules needed. 2) What is the "quality" of the input? For instance, if the input is "email" - sometimes users can be quite sloppy with respect to following good punctuation / grammar rules. For things like web-pages, you can often find other kinds of things related to formatting that give clues for breaking things into sentences, where punctuation is missing. For instance, titles (in big font) often don't have periods following them. List items, too, frequently don't, yet are often best treated as sentences (but not always).
As you can see - it may be hard to come up with a general advice here... -Marshall jonathan doklovic wrote: > Hi, > > I've been playing around with the opennlp wrappers and will probably > make use of the entity detection, but I was wondering about the sentence > and token detection. > > It seems that a model (statistical) based approach may be overkill and > more of a pain to correct errors in. > > I was wondering if there's any reason not to use a rule based > sentence/token detector that then feeds the opennlp pos and entity model > based annotators? > > Any thought are welcome. > > - Jonathan > > >
