My immediate reaction to this TR was that it was doomed, given how 
difficult it is to tokenize text perfectly (I have written a number of 
tokenizers for natural language processing, and they are never 
complete).  However, after reading the draft, I found myself agreeing 
that it is reasonable to provide =some= guidance for the 80% solution.  
So, I looked at the code for some of my tokenizers.  Most of the special 
cases covered there are not appropriate for the TR, but I do have the 
following suggestion:

Consider adding U+0026 (ampersand) to the MidLetter class.  I did a 
quick scan through a few million words of New York Times data I have, 
and found that most mid-word occurrences would probably not induce word 
breaks, e.g.,

   Q&A
   R&R
   AT&T
   P&G
   ...

Exceptions included:

   Ben&Jerry
   How&Why

Perhaps a more conservative rule would involve only uppercase letters ...

A caveat: I am unfamiliar with analogous cases in languages other than 
English.

- John Burger
   MITRE



Reply via email to