2011/7/1 Richard Wordingham <[email protected]>: > I wonder if anyone has some statistics on the use of CGJ. Its revised > intended use was to disrupt collating sequences, but you may be right > about its most frequent use being to disrupt canonical reordering. A > few years ago I concluded it wasn't yet safe to type the Welsh place > name Llan͏gollen with CGJ.
Interestingly, I can't have this name being rendered correctly in my Chrome version on Windows 7; it just displays the occurence of CGJ as a non-spacing dotted box, overwriting the surrounding characters "n" and "g" so that the place is completely unreadable. I just wonder why Chrome needs to display this control in such a disruptive way (I have not checked with other browsers). Why do you need CGJ between "n" and "g" ? - Is that to make sure that they won't collate as a single element "ng" but separately ? How is it different here from the collation of "language" where the situation would be similar? - Or do you intend to do the reverse, i.e. effectively collate "ng" in "Llangollen" as a single element? Sorry I don't know Welsh, all I know is that "ng" is a digram of its alphabet, which also includes "n" and "g" as separate letters... Other digrams are "dd" contrasting with isolated "d", "ff" contrasting with isolated "f", "ll" contrasting with isolated "l", "ph" contrasting with isolated "p" and "h", "rh" contrasting with isolated "r" and "h", and finaly "th" contrasting with isolated "t" and "h". Those Welsh digrams are not exceptional, you'll find them in many other Latin-based languages, except that they are not considered as single letters in their alphabets. Welsh is very near from Breton, but the latter still lists much fewer digraphs/trigraphs (such as "ch" and "c’h"). French or English for example use a lot of digrams as well but due to the huge number of lexical imports from various etymologies, these languages have not attempted to fix a rule in their alphabet for digraphs, and so it just list letters as separate. The digram analysis requires contextual analysis of phonology and morphology, including dictionary lookups to fix the correct hyphenation. Such contextual lexical lookup is probably needed as well in Welsh, that certainly borrows lots of English words today. If your intent is to indicate to a word hyphenator some "don't break here" condition (in the middle of an exceptional digram), or "break allowed here" (in the middle of what the language alphabet generally considers as an unbreakable digram), there are probably better controls (other kinds of joiners/disjoiners) than CGJ to specify that. [There exists some C1 controls inherited from ISO 8859-1 and EBCDIC, except that these C1 controls have very poor support and various incompatible system-specific usage, or would not be allowed in transport layers, or could be considered invalid by some document technical parsers. Another well supported control is the SOFT HYPHEN which explicitly encodes a "break allowed here", and that you could insert just before the "ng" digram in "Llangollen" if it is effectively a digram in this context.]

