Peter Kirk wrote:Well, agreed, there may be orthographic conventions in which a spacing diacritic is considered a word boundary or a break opportunity e.g. if used like a hyphen. But there are other mechanisms for forcing a word boundary where otherwise there would not be one. Are there to suppress a word boundary? Perhaps I need to encode <WJ, space, diacritic, WJ> to avoid the word boundary implication? Would this work?
I think this may be a "Peter mistake". I meant to refer to spacing diacritics. Sorry.
It is certainly highly inappropriate for spacing diacritics to be considered word boundaries.
Why? It is entirely dependent on the orthography and conventions involved. ...
... There is probably as much (or more) bad ASCII usageBut this is an abuse of the spacing diacritic as punctuation. Proper, linguistically appropriate use of spacing diacritics should not be broken in order to support abuse. Or, if the standard wants to support such abuse, we can reserve <space, diacritic> for the abuse and define a new character XXX such that <XXX, diacritic> has the properties for the linguistically appropriate use.
of spacing diacritics like `this', where a grave accent character
is being misapplied to make a directional quotation mark, as
there is actual, linguistically appropriate use of spacing
diacritics.
Then let it be correctly informative and not full of misinformation. And let its default mechanism and recommendations be appropriate for the majority of uses, including such cases as list of diacritics which may occur in any orthography.Also, everyone should consider carefully the status of UAX #29, Text Boundaries.
<quote> 2 Conformance
This is informative material. There are many different ways to
divide text elements corresponding to grapheme clusters, words and sentences, and the Unicode Standard and this document do not
restrict the ways in which implementations can do this.
This specification is a <emphasis>default</emphasis> mechanism; more sophisticated engines can and should tailor it for particular locales or environments. ... </quote>
The whole UAX is informative. ...
Ken, it seems to me all the more clearly from looking at the latest batch of postings on this list that the <space, diacritic> mechanism defined by Unicode is fundamentally flawed. It works, but it creates a serious and needless complication for all kinds of other processes, including rendering and higher level processes. These processes cannot simply take a space as a space and process it as such. Every time they come across a space (which is very often!) they have to test whether it is followed by a combining character, and if it is they have to treat that space specially. This has created a serious problem for implementers, which is why they have produced non-conforming implementations - and we are not talking about small companies which have rushed into the market recently, we are talking about Microsoft, among others, which has been sponsoring Unicode for the start, I understand. Surely the UTC should not create difficulties for implementers and then just shout at them for getting things wrong. The UTC should try to produce a standard which is workable without unnecessary complications
I agree that it works better to use NBSP here. There are fewer such problems, but they have not gone away entirely. And NBSP is more likely to be treated by implementers (in the absence of other guidelines from Unicode) as fixed width, not trimmed to the width needed for the diacritic.
-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/

