2011/7/17 Asmus Freytag <[email protected]>: > On 7/17/2011 2:35 AM, Michael Everson wrote: >> >> ... invisible and stateful control characters are more expensive than >> ordinary graphic symbols. > > In this case, the expense is so much higher as to rule out such an idea from > the start. > > A./ > > PS: this doesn't mean that adding graphic symbols is the foregone thing to > do, only that, if evidence points to the need to address this issue in > character encoding, then, using graphic symbols is the better way to go > about it.
Another alternative: instead of encoding separate symbols for each control, we could as well encode symbols for each character visible in those symbols. E.g. ro represent the glyph for the RLO control, we could encode three characters, one for each of R, L, and O, as DOTTED SYMBOL FOR LATIN CAPITAL LETTTER R, DOTTED SYMBOL FOR LATIN CAPITAL LETTER L, DOTTED SYMBOL FOR LATIN CAPITAL LETTER O. These three symbols would have a representative glyph as the base letter from which they are derived, within a dotted rectangle. Then each of them would contextually adopt one of four glyph forms : the full rectangle, or the rectangle with the left or right side removed, or both sides removed. The selection would be performed selectively. If this is still too complex, because fonts would have to lookup for lots of pairs, we could instead use the normal latin letters or symbols, each one modified by an enclosing diacritic encoded after them (with combining class zero, so that it will not reorder during normalization, and with general category "Me" for enclosing). In this case we just need to encode four diacritics : U+xxx0: ENCLOSING DOTTED SQUARE JOINED ON BOTH SIDES (short alias "EDSB" below) U+xxx1: ENCLOSING DOTTED SQUARE JOINED ON START SIDE ONLY (alias "EDSS") U+xxx2: ENCLOSING DOTTED SQUARE JOINED ON END SIDE ONLY (alias "EDSE") U+xxx3: ENCLOSING DOTTED SQUARE DISJOINED (alias "EDSD") Then to represent the symbol for RLO in a dotted square, we would use <R, EDSS, L, EDSB, O, EDSE>. The only problem with this representation using normal characters is that fonts (or text renderers) may have to reduce the size of the glyphs for the characters within these enclosing boxes bor best display (but this should not be a requirement, there's no fundamental difference, the only change being the overall widht/height of the fully composed "symbol"). No complexity, no control used. Only the visible symbol is represented, not the control that this string represents (there's not even a requirement that such string represents an actual Unicode character, it could be used for various symbols, or in texts that need to encode such enclosing). It can enclose any kind of character of any script, including diacritics or digits, or non-breaking spaces. And by extension we could as well as similar diacritics added for enclosing dotted circles/ovals, or for enclosing squares/rectangles, of arbitrary lengths. Note that we already have combining characters for enclosing boxes and circles, so this is not really a new concept in Unicode. It's true that such representation using explicitly encoded diacritics is an alternative to text decorations used in rich text formats. The encoding would be enough expensive that it would discourage its use for enclosing arbitrarily long texts (that will certainly better benefit from an external text decoration of a "span" of text in a higher-level protocol (such as CSS using "border:" properties). One caveat, is that such sequence would be collated not as a single grapheme cluster (is it a problem? this is already the case when a text already cites the abbreviation "RLO" using plain Latin letters, possibly surrounded by regular punctuations/symbols or spaces), and could collide with words appearing directly on each side (only a problem for word breakers, but if a SPACE character is not separating the "symbol" from the surrounding text, we can still use a ZERO WIDTH SPACE to separate them). I see also no defect of those sequences are not recognized as "symbols", but as words. It would even benefit to spellers, that would easily detect that the enclosed letters are in fact considered like abbreviations, where each grapheme is decorated by these diacritics. Note: in a previous message I already spoke as another alternative, using start and end punctuations (i.e. general category "Ps" and "Pe") that would be normal base characters (similar to parentheses, brackets and braces), but the difficulty is to have them connect graphically on top of sequences of separate grapheme clusters. -- Philippe.

