On 23 Oct 2009, at 04:20, Ian Hickson wrote:
On Wed, 21 Oct 2009, Øistein E. Andersen wrote:
ASCII-compatibility:
The note in 2.1.5 Character encodings seems to say that [...]
ISO-2022[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, and
I cannot
find anything in Section 2.1.5 that would explain this difference.
HZ-GB-2312 uses the byte ASCII uses for "~" as the escape character.
ISO-2022-* uses the control codes. That's the difference.
'~'/0x7E is not (and should not be, as far as I can tell) relevant for
HTML5's concept of ASCII compatibility.
Discouraged encodings: [...]
Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
(JIS_X0212-1990), [...]
It is not clear what this means [...]
This is talking about character encodings, not character sets.
"JIS_C6226-1983" is a registered character encoding in the IANA
registry.
(This is less confusing now since HTML5 only deals with character
encodings and the strings match those in the the IANA registry as
suggested by Yui Naruse.)
the list of discouraged encodings seems conspicuously short if it is
supposed to be complete; and the lack of rationale makes it
difficult to
understand why these encodings are considered particularly harmful
(JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention but
two
at least initially puzzling cases).
The reason for including these is to discourage encodings known to
have
security issues. I've added HZ-GB-2312, which can be used in a
similarly
dangerous fashion. (Basically the danger for user agents is in an
attacker
using an encoding that a user agent could autodetect, while a site
interprets the bytes safely; that would allow those encodings to be
used
to smuggle <script> elements in a way that a naive whitelisting filter
would think is safe.)
It might be better to say *why* particular encodings are better
avoided,
whether or not the list of discouraged encodings be presented as
definitive.
I've added a note.
[...]
On Thu, 22 Oct 2009, Philip Taylor wrote:
The string "[숍訊昱穿]" encoded as ISO-2022-KR is the bytes 0e
3c 73
63 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g.
Chrome,
when I last checked) will decode it as Windows-1252 and get the
string
"<script>", which is bad. So a site that uses ISO-2022-KR is very
likely
to expose some users to XSS attacks, which seems like a good reason
to
discourage that encoding. The same applies to other ISO-2022
encodings.
[...]
On Thu, 22 Oct 2009, Øistein E. Andersen wrote:
If that is the reason, at least HZ encoding would seem to be
affected as
well. Explicitly discouraging a more or less random subset of the
problematic encdodings without providing rationale makes it
difficult to
assess whether or not other, somewhat similar, encodings should be
avoided as well, which was the main issue I wanted to raise.
Hopefully this is somewhat addressed now.
The added note certainly helps, but it is vague (does "[m]ost of these
encodings" mean "all the encodings mentioned above apart from
UTF-32"?) and inaccurate (Philip Taylor's example does not rely on
"bugs").
Given that the set of encodings is open-ended, I still think it would
be preferable to make the rationale (a definition of what makes an
encoding problematic) primary and mention actual encodings as
examples. This could give something like the following: "Encodings in
which a series of bytes in the range 0x20..0x7E may encode characters
other than the corresponding characters in the range U+20..U+7E
represent a potential security vulnerability since a browser that does
not support the encoding (or does not support the label used to
declare the encoding, or does not use the same mechanism to detect the
encoding of unlabelled content) might end up interpreting technically
benign plain text content as HTML tags and JavaScript. In particular,
this applies to encodings in which the bytes corresponding to
'<script>' in ASCII may encode a different string. Authors should not
use such encodings, which are known to include.... In addition,
authors should not use UTF-32 ...." Alternatively, fixing the current
note would help and might be sufficient, albeit not ideal.
I think one has to realise that a comprehensive list of problematic
encodings is an elusive goal and act accordingly.
--
Øistein E. Andersen
PS: The following sentence makes little sense without (curly) quotes
and apostrophes. In case they disappeared before you read it, please
find it repeated below with (ASCII) quotes and apostrophes:
It should probably be "advise against authors' using legacy
encodings"
or better "advise authors against using legacy encodings".
(The current text in the spec is fine.)