Re: Sentence Rules vs. Models

Marshall Schor Fri, 14 Dec 2007 15:45:58 -0800

Some other things to think about.

1) What languages?  It is possible that the language(s) you're
interested in may have simple or complex rules needed.
2) What is the "quality" of the input?  For instance, if the input is
"email" - sometimes users can be quite sloppy with respect to following
good punctuation / grammar rules.  For things like web-pages, you can
often find other kinds of things related to formatting that give clues
for breaking things into sentences, where punctuation is missing.  For
instance, titles (in big font) often don't have periods following them. 
List items, too, frequently don't, yet are often best treated as
sentences (but not always).


As you can see - it may be hard to come up with a general advice here...

-Marshall

jonathan doklovic wrote:
> Hi,
>
> I've been playing around with the opennlp wrappers and will probably
> make use of the entity detection, but I was wondering about the sentence
> and token detection.
>
> It seems that a model (statistical) based approach may be overkill and
> more of a pain to correct errors in.
>
> I was wondering if there's any reason not to use a rule based
> sentence/token detector that then feeds the opennlp pos and entity model
> based annotators?
>
> Any thought are welcome.
>
> - Jonathan
>
>
>

Re: Sentence Rules vs. Models

Reply via email to