Re: "Missing character" glyph

Martin Kochanski Thu, 01 Aug 2002 03:21:39 -0700

The responses from this mailing list have made me re-think the problem and propose a 
possible solution.


The point about missing characters (more accurately, "unrendered characters") is that 
different fonts (more accurately, different combinations of font plus rendering 
system) display them in different ways. I have seen hollow squares and rectangles; 
filled rectangles; small diamond-shaped bullets; and question marks.

Unrendered characters will become more noticeable as Unicode becomes more widespread 
and computing increasingly transcends linguistic and script boundaries. On the whole, 
with existing 7-bit and 8-bit national standards, a user in any particular country 
will find that any character that can be encoded can also be displayed, so that the 
distinction between encodable and displayable characters is one that simply does not 
need to occur to an ordinary user. But someone using Unicode to view (for example) Web 
pages from another country may find that the fonts on his computer are missing some 
vital characters, which the computer then renders in an arbitrary way (as hollow 
squares, etc); leading to puzzlement and confusion. Eventually, as "large" Unicode 
fonts become more widely installed, the problem will diminish; but it will never 
entirely go away unless the Unicode standard stops evolving.

There is a need to talk about what an unrendered character looks like when explaining 
the concept to a user and explaining that special actions may need to be taken (for 
instance, changing fonts or downloading a new version of a font). 

Printed manuals can handle unrendered characters quite easily. The manual can use one 
arbitrarily chosen appearance (such as U+25AF or U+2337) for unrendered characters, 
with a note (on first occurrence) that the screen appearance of unrendered characters 
may vary - screenshots can be given as examples.

On-screen text does, however, present problems: especially Web pages. The writer of 
the text has no control over the font that will be used to display it [in some cases 
he may be able to specify or request the *name* of the font to be used, but this is no 
guarantee that the font of that name will contain all the needed characters or that it 
will even be installed on the user's computer]. There is a need to be able to say in a 
web page: "If some of the text on this page looks like this: ????? then you should 
install font XXXX / download a new font from [link]" - where ????? looks *exactly* how 
an unrendered character would look in the font that the web page is being displayed 
with.

No presently defined Unicode character can be used to represent <?> in the above 
message. A hollow rectangle such as U+25AF or U+2337 will only resemble the screen 
appearance of unrendered characters if the font being used happens to use that 
particular sort of hollow rectangle to represent unrendered characters: in a font that 
uses small diamonds, representing <?> as a hollow square would be confusing 
counter-productive.

For the same reason, a bitmap cannot be used: a bitmap's appearance will not vary 
automatically as the font used to display the message changes.

Rewriting the message to say "If a lot of the text on this page looks like hollow 
squares or small solid rectangles or little diamonds or anything else strange, then 
you should install font XXXX / download a new font from [link]" is not a practical 
solution because it adds complexity, obscurity, and verbosity; adds a level of 
abstraction that it is neither necessary nor easy for the user to follow; and uses up 
valuable screen space.

It follows that there is a need for a defined Unicode character that represents the 
appearance of an unrendered character in the font in which it is displayed.

I am wondering whether it would be worth submitting a proposal for such a character. 
For example: 
        U+024F UNRENDERED CHARACTER

While the addition of characters to Unicode is something to be done only as a last 
resort, I believe that there is, in this case, no alternative.

Such a character proposal would have the advantage that every existing Unicode font 
*already* implements it correctly - by definition [but see the note below about 
section 5.3 of the Unicode standard]. Thus no changes will be needed to fonts or to 
rendering engines.

To look at it another way, virtually the only action that the Unicode Consortium needs 
to take to define UNRENDERED CHARACTER is to promise never to define a character at 
that code point.

UNRENDERED CHARACTER has to be part of the BMP for backward compatibility: it should 
be renderable as a single glyph, not as a pair of glyphs, even on old systems that do 
not understand surrogates. The proposed positioning is intended to persuade older 
systems that this character should be rendered conventionally, like a Latin letter.

The nearest possible alternatives are:

U+FFFE - on at least some Windows systems, this is displayed correctly (ie. 
identically to characters that are missing from the current font); but in the Unicode 
standard it has the explicit semantics of not being a character at all, and so ought 
not to be intentionally used as a character (a rendering engine would be within its 
rights to suppress it altogether; some application programs might report errors or 
even become confused about byte ordering).

U+FFFD - on at least some Windows systems, this is displayed correctly (ie. 
identically to characters that are missing from the current font); but in the Unicode 
standard it has the explicit semantics of being a replacement for a character 
*unrepresentable in Unicode*. A character unrepresentable in Unicode is not the same 
as a Unicode character that happens not to have a representation in the current font. 
It is possible that a particular font may have distinctive visual representations of 
U+FFFC and U+FFFD that are distinct from the way that it draws unrendered characters.

Otto Stolz suggested U+03A2, which would be equally valid. However, U+03A2 is quite 
obviously the code for GREEK CAPITAL LETTER FINAL SIGMA. For O.S., this is a reason 
for using the code (because there is, in fact, no such letter, so the code can be 
used); for me, this is a strong reason for *not* using the code, because if it 
**ever** became necessary to encode GREEK CAPITAL LETTER FINAL SIGMA then no character 
other than U+03A2 would be acceptable, whereas U+024F has no inherent semantics at all.

Section 5.3 of the Unicode standard makes a distinction between unassigned and 
unrenderable characters. Systems that make use of this distinction are an exception to 
the statement I made earlier that "every existing Unicode font already renders 
UNRENDERED CHARACTER correctly". Nevertheless, the rendering of UNRENDERED CHARACTER 
as "unassigned" rather than "unrenderable" is unlikely to cause much confusion.

One other exception would be a pathologically helpful font/engine that represents each 
unrendered character as a unique glyph (for example, a miniature of the character's 
hexadecimal value). This, again, would not be a problem: the user will instantly 
recognize "miniature 024F" as being different from ordinary characters and in the same 
class as the "miniature 021D" glyphs that disfigure the page.

Would it be worth submitting a proposal for UNRENDERED CHARACTER? As I said, it *is* 
adequately implemented already: the only purpose for wanting it defined in the 
standard is to prevent the implementation from being suddenly broken in the future.

- Martin Kochanski.

Re: "Missing character" glyph

Reply via email to