Martin Kochanski <unicode at cardbox dot net> wrote: > To look at it another way, virtually the only action that the Unicode > Consortium needs to take to define UNRENDERED CHARACTER is to promise > never to define a character at that code point.
I think this is exactly what they have done by creating the "noncharacters" from U+FDD0 through U+FDEF. These code points are guaranteed never to be assigned to real characters. The better-known noncharacters U+FFFE and U+FFFF have some "suggested" semantics that may discourage their use in applications such as Martin's. U+FFFE is a byte-swapped UTF-16 BOM, so certain software might handle it specially with that in mind (e.g. it might byte-swap the rest of the text). U+FFFF is -1 in a 16-bit environment, so some software might intentionally use it as a sentinel or other special value. OTOH, there is nothing numerically special about the code points U+FDD0 through U+FDEF, and it seems unlikely that much software knows about them or handles them in any special way, so Martin can probably use them without interference from the OS or app. > UNRENDERED CHARACTER has to be part of the BMP for backward > compatibility: it should be renderable as a single glyph, not as a > pair of glyphs, even on old systems that do not understand surrogates. > The proposed positioning is intended to persuade older systems that > this character should be rendered conventionally, like a Latin letter. The suggested noncharacter code points are indeed in the BMP (there are others outside the BMP). Putting such a beast in or near an alphabetic script block, however, implicitly assigns a meaning to it (e.g. "this is an unrendered character for use with alphabetic scripts"), which is exactly what Martin was trying to avoid. Special formatting characters and characters intended to aid special-purpose display scenarios (like the control pictures) are intentionally segregated far from the alphabetic script blocks > Otto Stolz suggested U+03A2, which would be equally valid. However, > U+03A2 is quite obviously the code for GREEK CAPITAL LETTER FINAL > SIGMA. For O.S., this is a reason for using the code (because there > is, in fact, no such letter, so the code can be used); for me, this > is a strong reason for *not* using the code, because if it **ever** > became necessary to encode GREEK CAPITAL LETTER FINAL SIGMA then no > character other than U+03A2 would be acceptable, whereas U+024F has no > inherent semantics at all. The only reason to ever encode GREEK CAPITAL LETTER FINAL SIGMA, or for that matter, LATIN CAPITAL LETTER SHARP S, would be to make some sort of typographical point or engage in sort kind of spelling reform. As we know, spelling reforms are usually aimed at making things simpler, not more complex, and "odd" letters like Greek final sigma and Latin sharp-s are part of that complexity. So it really shouldn't ever be necessary to assign such a thing, but of course Asmus is correct; no code point is 100% safe. My recommendation: Use the noncharacters. That's what they're there for. -Doug Ewell Fullerton, California