2014-06-29 21:44, Koji Ishii wrote:

The spec currently has the following text[2]:

Control characters (Unicode class Cc) other than tab (U+0009), line
feed (U+000A), and carriage return (U+000D) are ignored for the
purpose of rendering. (As required by [UNICODE], unsupported
Default_ignorable characters must also be ignored for rendering.)

and there’s a feedback saying that CSS should display visible glyphs
for these control characters.

That would change the identity of the characters. They are by definition “control characters”, i.e. they have no visible glyphs, but they may have control effects. However, it might be argued that rendering them somehow would not mean normal rendering but be a diagnostic indication of an error. Those characters are invalid in HTML and XML (except XML 1.1, but who uses it?).

However, the tradition of web browsers is permissive in order to be user-friendly. E.g., a casual control character somewhere might be interesting to a *developer* or maintainer to notice, so that he could analyze and fix the problem that caused it, but to a *user* (visitor), it would mostly be just disturbing. He can’t fix the problem, and is mostly useless to him to see that the page has some control character in the source. So *developer tools* should indicate should problems or provide ways to detect, but it seems correct to ignore them in normal rendering.

Since all major browsers do not display
them today, this is a breaking-change

Well, I would not take that as strong argument. This would be a change in error processing. But it would not be useful for other reasons.

I found the following text in Unicode 6.3, p. 185, "5.21 Ignoring
Characters in Processing”[3]:

Surrogate code points, private-use characters, and control
characters are not given the Default_Ignorable_Code_Point property.
To avoid security problems, such characters or code points, when
not interpreted and not displayable by normal rendering, should be
displayed in fallback rendering with a fallback glyph

By looking at this, my questions are as follows:

1. Should control characters that browsers do not interpret be
displayed in fallback rendering?

It is reasonable to interpret that there are no such control characters, because all control characters except those with special handling are interpreted as being invalid data and therefore ignored.

2. Should private-use characters
(U+E000-F8FF, 0F0000-0FFFFD, 100000-10FFFD) without glyphs be
displayed in fallback rendering?

They might be seen as “not displayable by normal rendering”, so yes. On the practical side, although Private Use characters should not be used in public information interchange, they are increasingly popular in “icon font” tricks. Whatever we think of such tricks, users should not be punished for them. If the trick fails (usually because a page uses a downloadable font for icon glyphs allocated to Private Use codepoints but something prevents the use of such a font), it is relevant to the user to know that there is *some* data, which can be crucial (e.g., an item in a navigation menu). So some dull fallback rendering is probably better than simply ignoring the characters.

3. When the above text says “surrogate code points”, does that mean
everything outside BMP?

No, it means code points that do not represent *any* characters due to being in certain special areas in the coding space. They are invalid in HTML and in XML. If they appear in data, the reason is usually that UTF-16 encoded data containing non-BMP characters is being processed in a wrong way. At the level of interpreting a byte stream as a stream of characters, surrogate code *units* in UTF-16 should be processed and interpreted in pairs so that one pair is taken as one character. And when CSS gets at it, it only sees the character in the DOM.

It is adequate to ignore surrogate code points, since they are invalid and signalling them to users (as opposite to developers) would hardly do any good.

4. Should every code point that are not
given the Default_Ignorable_Code_Point property and that without
interpretations nor glyphs displayed in fallback rendering? I could
not find such statement in Unicode spec, but there are some people
who believe so.
> 5. Is there anything else Unicode recommends to
display in fallback rendering, or not to display? This must be RTFM,
but pointing out where to read would be appreciated.

From the Unicode point of view, an implementation may decide what characters it supports. What it does to characters that it does not support seems to be generally up to the implementation to decide as regards to rendering. Here, too, I would consider the practical impact on users. If a page contain characters that have no glyphs in the fonts that are used, then the page has data that is probably valid but cannot be rendered in a particular situation. Showing some indication of this is relevant, because the user knows he is missing something real, and he might be able to fix the situation in various ways (e.g., changing browser settings, downloading an installing extra fonts, or just switching to a different browser – browsers are known to differ in their abilities to use the fonts installed in a system).

Yucca


_______________________________________________
Unicode mailing list
[email protected]
http://unicode.org/mailman/listinfo/unicode

Reply via email to