My immediate reaction to this TR was that it was doomed, given how difficult it is to tokenize text perfectly (I have written a number of tokenizers for natural language processing, and they are never complete). However, after reading the draft, I found myself agreeing that it is reasonable to provide =some= guidance for the 80% solution. So, I looked at the code for some of my tokenizers. Most of the special cases covered there are not appropriate for the TR, but I do have the following suggestion:
Consider adding U+0026 (ampersand) to the MidLetter class. I did a quick scan through a few million words of New York Times data I have, and found that most mid-word occurrences would probably not induce word breaks, e.g., Q&A R&R AT&T P&G ... Exceptions included: Ben&Jerry How&Why Perhaps a more conservative rule would involve only uppercase letters ... A caveat: I am unfamiliar with analogous cases in languages other than English. - John Burger MITRE

