Re: [whatwg] ISO-8859-* and the C1 control range

Maciej Stachowiak Tue, 05 Jun 2007 10:00:30 -0700


On Jun 5, 2007, at 12:18 AM, Henri Sivonen wrote:

On May 29, 2007, at 13:13, Henri Sivonen wrote:
To avoid stepping on the toes of Charmod more than is necessary, Isuggest making it non-conforming for a document to have bytes inthe 0x80…0x9F range when the character encoding is declared to beone of the ISO-8859 family encodings.
I've been thinking about this. I have a proposal on how to specthis *conceptually* and how to implement this with error reporting.I am assuming here that 1) No one ever intends C1 code points to bepresent in the decoded stream and 2) we want, as a Charmodcorrectness fig leaf, to make the C1 bytes non-conforming whenISO-8859-1 or ISO-8859-11 was declared but Windows-1252 orWindows-874 decoding is needed.
Based on the behavior of Minefield and Opera 9.20, the followingseems to be the least Charmod violating and least quirky approachthat could possibly work:
1) Decode the byte stream using a decoder for whatever encoding wasdeclared, even ISO-8859-1 or ISO-8859-11, according to ftp://ftp.unicode.org/Public/MAPPINGS/.2) If a character in the decoded character stream is in the C1 codepoint range, this is a document conformance violation.2a) If the declared encoding was ISO-8859-1, replace thatcharacter with the character that you get by casting the code pointinto a byte and decoding it as Windows-1252.2b) If the declared encoding was ISO-8859-11, replace thatcharacter with the character that you get by casting the code pointinto a byte and decoding it as Windows-874.
[
The *simplest* and most robust (and maximally Charmod-violating)thing would be:
1) Decode the byte stream using a decoder for whatever encoding wasdeclared, even ISO-8859-1 or ISO-8859-11, according to ftp://ftp.unicode.org/Public/MAPPINGS/.2) If a character in the decoded character stream is in the C1 codepoint range, this is a document conformance violation. Replace thatcharacter with the character that you get by casting the code pointinto a byte and decoding it as Windows-1252.
But this isn't what Minefield, Opera 9.20 and WebKit nightlies do.
]

What we actually do in WebKit is always use a windows-1252 decoderwhen ISO-8859-1 is requested. I don't think it's very helpful to makeall documents that declare a ISO-8859-1 encoding and use charactersin the C1 range nonconforming. It's true that they are counting onnonstandard processing of the nominally declared encoding, but Idon't think that causes a problem in practice, as long as the rule iswell known. It seems simpler to just make latin1 an alias for winlatin1.


Regards,
Maciej

Re: [whatwg] ISO-8859-* and the C1 control range

Reply via email to