On 23 Oct 2009, at 04:20, Ian Hickson wrote:

On Wed, 21 Oct 2009, Øistein E. Andersen wrote:


ASCII-compatibility:
The note in ‘2.1.5 Character encodings’ seems to say that [...]
ISO-2022’[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, and I cannot
find anything in Section 2.1.5 that would explain this difference.

HZ-GB-2312 uses the byte ASCII uses for "~" as the escape character.
ISO-2022-* uses the control codes. That's the difference.

'~'/0x7E is not (and should not be, as far as I can tell) relevant for HTML5's concept of ASCII compatibility.

Discouraged encodings: [...]

Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
(JIS_X0212-1990), [...]

It is not clear what this means [...]

This is talking about character encodings, not character sets.
"JIS_C6226-1983" is a registered character encoding in the IANA registry.

(This is less confusing now since HTML5 only deals with character encodings and the strings match those in the the IANA registry as suggested by Yui Naruse.)

the list of discouraged encodings seems conspicuously short if it is
supposed to be complete; and the lack of rationale makes it difficult to
understand why these encodings are considered particularly harmful
(JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but two
at least initially puzzling cases).

The reason for including these is to discourage encodings known to have security issues. I've added HZ-GB-2312, which can be used in a similarly dangerous fashion. (Basically the danger for user agents is in an attacker
using an encoding that a user agent could autodetect, while a site
interprets the bytes safely; that would allow those encodings to be used
to smuggle <script> elements in a way that a naive whitelisting filter
would think is safe.)

It might be better to say *why* particular encodings are better avoided,
whether or not the list of discouraged encodings be presented as
definitive.

I've added a note.

[...]

On Thu, 22 Oct 2009, Philip Taylor wrote:

The string "[숍訊昱穿]" encoded as ISO-2022-KR is the bytes 0e 3c 73 63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g. Chrome, when I last checked) will decode it as Windows-1252 and get the string "<script>", which is bad. So a site that uses ISO-2022-KR is very likely to expose some users to XSS attacks, which seems like a good reason to discourage that encoding. The same applies to other ISO-2022 encodings.

[...]

On Thu, 22 Oct 2009, Øistein E. Andersen wrote:

If that is the reason, at least HZ encoding would seem to be affected as
well. Explicitly discouraging a more or less random subset of the
problematic encdodings without providing rationale makes it difficult to
assess whether or not other, somewhat similar, encodings should be
avoided as well, which was the main issue I wanted to raise.

Hopefully this is somewhat addressed now.


The added note certainly helps, but it is vague (does "[m]ost of these encodings" mean "all the encodings mentioned above apart from UTF-32"?) and inaccurate (Philip Taylor's example does not rely on "bugs").

Given that the set of encodings is open-ended, I still think it would be preferable to make the rationale (a definition of what makes an encoding problematic) primary and mention actual encodings as examples. This could give something like the following: "Encodings in which a series of bytes in the range 0x20..0x7E may encode characters other than the corresponding characters in the range U+20..U+7E represent a potential security vulnerability since a browser that does not support the encoding (or does not support the label used to declare the encoding, or does not use the same mechanism to detect the encoding of unlabelled content) might end up interpreting technically benign plain text content as HTML tags and JavaScript. In particular, this applies to encodings in which the bytes corresponding to '<script>' in ASCII may encode a different string. Authors should not use such encodings, which are known to include.... In addition, authors should not use UTF-32 ...." Alternatively, fixing the current note would help and might be sufficient, albeit not ideal.

I think one has to realise that a comprehensive list of problematic encodings is an elusive goal and act accordingly.

--
Øistein E. Andersen


PS: The following sentence makes little sense without (curly) quotes and apostrophes. In case they disappeared before you read it, please find it repeated below with (ASCII) quotes and apostrophes:

It should probably be ‘"advise against authors'’ using legacy encodings"
or better "‘advise authors against using legacy encodings"’.

(The current text in the spec is fine.)

Reply via email to