Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Øistein E . Andersen Fri, 23 Oct 2009 13:22:31 -0700

On 23 Oct 2009, at 04:20, Ian Hickson wrote:

On Wed, 21 Oct 2009, Øistein E. Andersen wrote:

ASCII-compatibility:
The note in 2.1.5 Character encodings seems to say that [...]
ISO-2022[-*] are ASCII-compatible, whereas HZ-GB-2312 is not, andI cannot
find anything in Section 2.1.5 that would explain this difference.


HZ-GB-2312 uses the byte ASCII uses for "~" as the escape character.
ISO-2022-* uses the control codes. That's the difference.

'~'/0x7E is not (and should not be, as far as I can tell) relevant forHTML5's concept of ASCII compatibility.

Discouraged encodings: [...]

Authors should not use JIS-X-0208 (JIS_C6226-1983), JIS-X-0212
(JIS_X0212-1990), [...]


It is not clear what this means [...]


This is talking about character encodings, not character sets.

"JIS_C6226-1983" is a registered character encoding in the IANAregistry.

(This is less confusing now since HTML5 only deals with characterencodings and the strings match those in the the IANA registry assuggested by Yui Naruse.)

the list of discouraged encodings seems conspicuously short if it is
supposed to be complete; and the lack of rationale makes itdifficult to
understand why these encodings are considered particularly harmful
(JIS_C6226-1983 v. JIS_C6226-1978 or ISO-2022 v. HZ, to mention buttwo
at least initially puzzling cases).
The reason for including these is to discourage encodings known tohavesecurity issues. I've added HZ-GB-2312, which can be used in asimilarlydangerous fashion. (Basically the danger for user agents is in anattacker
using an encoding that a user agent could autodetect, while a site
interprets the bytes safely; that would allow those encodings to beused
to smuggle <script> elements in a way that a naive whitelisting filter
would think is safe.)
It might be better to say *why* particular encodings are betteravoided,
whether or not the list of discouraged encodings be presented as
definitive.
I've added a note.

[...]

On Thu, 22 Oct 2009, Philip Taylor wrote:
The string "[숍訊昱穿]" encoded as ISO-2022-KR is the bytes 0e3c 7363 72 69 70 74 3e. A UA that doesn't support ISO-2022-KR (e.g.Chrome,when I last checked) will decode it as Windows-1252 and get thestring"<script>", which is bad. So a site that uses ISO-2022-KR is verylikelyto expose some users to XSS attacks, which seems like a good reasontodiscourage that encoding. The same applies to other ISO-2022encodings.
[...]

On Thu, 22 Oct 2009, Øistein E. Andersen wrote:
If that is the reason, at least HZ encoding would seem to beaffected as
well. Explicitly discouraging a more or less random subset of the
problematic encdodings without providing rationale makes itdifficult to
assess whether or not other, somewhat similar, encodings should be
avoided as well, which was the main issue I wanted to raise.
Hopefully this is somewhat addressed now.

The added note certainly helps, but it is vague (does "[m]ost of theseencodings" mean "all the encodings mentioned above apart fromUTF-32"?) and inaccurate (Philip Taylor's example does not rely on"bugs").

Given that the set of encodings is open-ended, I still think it wouldbe preferable to make the rationale (a definition of what makes anencoding problematic) primary and mention actual encodings asexamples. This could give something like the following: "Encodings inwhich a series of bytes in the range 0x20..0x7E may encode charactersother than the corresponding characters in the range U+20..U+7Erepresent a potential security vulnerability since a browser that doesnot support the encoding (or does not support the label used todeclare the encoding, or does not use the same mechanism to detect theencoding of unlabelled content) might end up interpreting technicallybenign plain text content as HTML tags and JavaScript. In particular,this applies to encodings in which the bytes corresponding to'<script>' in ASCII may encode a different string. Authors should notuse such encodings, which are known to include.... In addition,authors should not use UTF-32 ...." Alternatively, fixing the currentnote would help and might be sufficient, albeit not ideal.

I think one has to realise that a comprehensive list of problematicencodings is an elusive goal and act accordingly.


--
Øistein E. Andersen

PS: The following sentence makes little sense without (curly) quotesand apostrophes. In case they disappeared before you read it, pleasefind it repeated below with (ASCII) quotes and apostrophes:

It should probably be "advise against authors' using legacyencodings"
or better "advise authors against using legacy encodings".


(The current text in the spec is fine.)

Re: [whatwg] Superset encodings [Re: ISO-8859-* and the C1 control range]

Reply via email to