From: "Philippe Verdy" <[EMAIL PROTECTED]>

Unicode already defines with character properties those punctuations that terminate sentences. Why would you need to recognize sequences of two spaces as meaning an end of sentence???

Ambiguity remains. My colleague David Palmer did some testing of various algorithms:


http://citeseer.nj.nec.com/palmer97adaptive.html

The simplest heuristic approach, slightly more sophisticated than the Emacs regular expression someone mentioned, misclassified periods about 8% of the time on an annotated Wall Street Journal corpus. David's SATZ program, which uses a neural net or a decision tree trained on a similar corpus, got just above a 1% error rate. A Flex-based English tokenizer I had built previously got down to 0.9%, using a list of 75 common abbreviations and about 100 rules (not all of which had to do with sentence-boundary disambiguation). Some later work that David and I did combined the latter two approaches. If I remember correctly, the amalgam had a 0.5% error rate on the same evaluation corpus.

SATZ's results on French and German data were better, hovering around 0.5% - there was less period-ambiguity in those corpora.

Like many natural language phenomena, this problem is harder than some think, at first glance.

- John Burger
  MITRE





Reply via email to