The responses from this mailing list have made me re-think the problem and propose a
possible solution.
The point about missing characters (more accurately, "unrendered characters") is that
different fonts (more accurately, different combinations of font plus rendering
system) display them in different ways. I have seen hollow squares and rectangles;
filled rectangles; small diamond-shaped bullets; and question marks.
Unrendered characters will become more noticeable as Unicode becomes more widespread
and computing increasingly transcends linguistic and script boundaries. On the whole,
with existing 7-bit and 8-bit national standards, a user in any particular country
will find that any character that can be encoded can also be displayed, so that the
distinction between encodable and displayable characters is one that simply does not
need to occur to an ordinary user. But someone using Unicode to view (for example) Web
pages from another country may find that the fonts on his computer are missing some
vital characters, which the computer then renders in an arbitrary way (as hollow
squares, etc); leading to puzzlement and confusion. Eventually, as "large" Unicode
fonts become more widely installed, the problem will diminish; but it will never
entirely go away unless the Unicode standard stops evolving.
There is a need to talk about what an unrendered character looks like when explaining
the concept to a user and explaining that special actions may need to be taken (for
instance, changing fonts or downloading a new version of a font).
Printed manuals can handle unrendered characters quite easily. The manual can use one
arbitrarily chosen appearance (such as U+25AF or U+2337) for unrendered characters,
with a note (on first occurrence) that the screen appearance of unrendered characters
may vary - screenshots can be given as examples.
On-screen text does, however, present problems: especially Web pages. The writer of
the text has no control over the font that will be used to display it [in some cases
he may be able to specify or request the *name* of the font to be used, but this is no
guarantee that the font of that name will contain all the needed characters or that it
will even be installed on the user's computer]. There is a need to be able to say in a
web page: "If some of the text on this page looks like this: ????? then you should
install font XXXX / download a new font from [link]" - where ????? looks *exactly* how
an unrendered character would look in the font that the web page is being displayed
with.
No presently defined Unicode character can be used to represent <?> in the above
message. A hollow rectangle such as U+25AF or U+2337 will only resemble the screen
appearance of unrendered characters if the font being used happens to use that
particular sort of hollow rectangle to represent unrendered characters: in a font that
uses small diamonds, representing <?> as a hollow square would be confusing
counter-productive.
For the same reason, a bitmap cannot be used: a bitmap's appearance will not vary
automatically as the font used to display the message changes.
Rewriting the message to say "If a lot of the text on this page looks like hollow
squares or small solid rectangles or little diamonds or anything else strange, then
you should install font XXXX / download a new font from [link]" is not a practical
solution because it adds complexity, obscurity, and verbosity; adds a level of
abstraction that it is neither necessary nor easy for the user to follow; and uses up
valuable screen space.
It follows that there is a need for a defined Unicode character that represents the
appearance of an unrendered character in the font in which it is displayed.
I am wondering whether it would be worth submitting a proposal for such a character.
For example:
U+024F UNRENDERED CHARACTER
While the addition of characters to Unicode is something to be done only as a last
resort, I believe that there is, in this case, no alternative.
Such a character proposal would have the advantage that every existing Unicode font
*already* implements it correctly - by definition [but see the note below about
section 5.3 of the Unicode standard]. Thus no changes will be needed to fonts or to
rendering engines.
To look at it another way, virtually the only action that the Unicode Consortium needs
to take to define UNRENDERED CHARACTER is to promise never to define a character at
that code point.
UNRENDERED CHARACTER has to be part of the BMP for backward compatibility: it should
be renderable as a single glyph, not as a pair of glyphs, even on old systems that do
not understand surrogates. The proposed positioning is intended to persuade older
systems that this character should be rendered conventionally, like a Latin letter.
The nearest possible alternatives are:
U+FFFE - on at least some Windows systems, this is displayed correctly (ie.
identically to characters that are missing from the current font); but in the Unicode
standard it has the explicit semantics of not being a character at all, and so ought
not to be intentionally used as a character (a rendering engine would be within its
rights to suppress it altogether; some application programs might report errors or
even become confused about byte ordering).
U+FFFD - on at least some Windows systems, this is displayed correctly (ie.
identically to characters that are missing from the current font); but in the Unicode
standard it has the explicit semantics of being a replacement for a character
*unrepresentable in Unicode*. A character unrepresentable in Unicode is not the same
as a Unicode character that happens not to have a representation in the current font.
It is possible that a particular font may have distinctive visual representations of
U+FFFC and U+FFFD that are distinct from the way that it draws unrendered characters.
Otto Stolz suggested U+03A2, which would be equally valid. However, U+03A2 is quite
obviously the code for GREEK CAPITAL LETTER FINAL SIGMA. For O.S., this is a reason
for using the code (because there is, in fact, no such letter, so the code can be
used); for me, this is a strong reason for *not* using the code, because if it
**ever** became necessary to encode GREEK CAPITAL LETTER FINAL SIGMA then no character
other than U+03A2 would be acceptable, whereas U+024F has no inherent semantics at all.
Section 5.3 of the Unicode standard makes a distinction between unassigned and
unrenderable characters. Systems that make use of this distinction are an exception to
the statement I made earlier that "every existing Unicode font already renders
UNRENDERED CHARACTER correctly". Nevertheless, the rendering of UNRENDERED CHARACTER
as "unassigned" rather than "unrenderable" is unlikely to cause much confusion.
One other exception would be a pathologically helpful font/engine that represents each
unrendered character as a unique glyph (for example, a miniature of the character's
hexadecimal value). This, again, would not be a problem: the user will instantly
recognize "miniature 024F" as being different from ordinary characters and in the same
class as the "miniature 021D" glyphs that disfigure the page.
Would it be worth submitting a proposal for UNRENDERED CHARACTER? As I said, it *is*
adequately implemented already: the only purpose for wanting it defined in the
standard is to prevent the implementation from being suddenly broken in the future.
- Martin Kochanski.