Peter Kirk wrote: > I think this may be a "Peter mistake". I meant to refer to spacing > diacritics. Sorry. > > It is certainly highly inappropriate for spacing diacritics to > be considered word boundaries.
Why? It is entirely dependent on the orthography and conventions involved. There is probably as much (or more) bad ASCII usage of spacing diacritics like `this', where a grave accent character is being misapplied to make a directional quotation mark, as there is actual, linguistically appropriate use of spacing diacritics. Also, everyone should consider carefully the status of UAX #29, Text Boundaries. <quote> 2 Conformance This is informative material. There are many different ways to divide text elements corresponding to grapheme clusters, words and sentences, and the Unicode Standard and this document do not restrict the ways in which implementations can do this. This specification is a <emphasis>default</emphasis> mechanism; more sophisticated engines can and should tailor it for particular locales or environments. ... </quote> The whole UAX is informative. It is a here's-how-you-can-approach- the-problem implementation guide with some suggestions for rules and classes. *If* you are working with an orthography that uses one or more spacing diacritics, and *If* those spacing diacritics need to be represented by <SPACE, NSM> sequences, then you are in the situation where your implementation of text boundaries should take <SPACE, NSM> sequences explicitly into account, so as to result in expected behavior for that orthography. Everyone has had experiences with their platform UI producing bad results for text boundaries. The Solaris platform I am writing this on right now, for example, implements a double-click word selection that treats the string "`this'," above, including the grave accent, the apostrophe, and the comma, as a "word". Is that right or wrong? Well, it depends on what you are trying to do, I expect. But even the most sophisticated platform implementers can only do so much with processes like default word selection. It is bound to be wrong for one purpose or another and for one orthography or another. Ultimately you need to have tailored processes that can be orthography-specific if you want to get best results. --Ken

