On Mar 10, 2006, at 22:49, Ian Hickson wrote:

I'm actually considering just requiring that UAs support rewinding (by
defining the exact semantics of how to parse for the <meta> header). Is
this something people would object to?

I think allowing in-place decoder change (when feasible) would be good for performance.

I think it would be beneficial to additionally stipulate that
1. The meta element-based character encoding information declaration is expected to work only if the Basic Latin range of characters maps to the same
bytes as in the US-ASCII encoding.

Is this realistic? I'm not really familiar enough with character encodings
to say if this is what happens in general.

I suppose it is realistic. See below.

2. If there is no external character encoding information nor a BOM (see
below), there MUST NOT be any non-ASCII bytes in the document byte
stream before the end of the meta element that declares the character
encoding. (In practice this would ban unescaped non-ASCII class names on the html and [head] elements and non-ASCII comments at the beginning of
the document.)

Again, can we realistically require this? I need to do some studies of
non-latin pages, I guess.

As UA behavior, no. As a conformance requirement, maybe.

Authors should avoid including inline character encoding information.
Character encoding information should instead be included at the
transport level (e.g. using the HTTP Content-Type header).

I disagree.

With HTML with contemporary UAs, there is no real harm in including the character encoding information both on the HTTP level and in the meta as
long as the information is not contradictory. On the contrary, the
author-provided internal information is actually useful when end users
save pages to disk using UAs that do not reserialize with internal
character encoding information.

...and it breaks everything when you have a transcoding proxy, or similar.

Well, not until you save to disk, since HTTP takes precedence. However, authors can escape this by using UTF-8. (Assuming here that tampering with UTF-8 would be harmful, wrong and pointless.)

Interestingly, transcoding proxies tend to be brought up by residents of Western Europe, North America or the Commonwealth. I have never seen a Russion person living in Russia or a Japanese person living in Japan talk about transcoding proxies in any online or offline discussion. That's why I doubt the importance of transcoding proxies.

FWIW, I think Opera Mini is a distributed UA--not a proxy and a UA.

Character encoding information shouldn't be duplicated, IMHO, that's just
asking for trouble.

I suggest a mismatch be considered an easy parse error and, therefore, reportable.

For HTML, user agents must use the following algorithm in determining the
character encoding of a document:
1. If the transport layer specifies an encoding, use that.

Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8 only; UTF-32
makes no practical sense for interchange on the Web.)

I don't know, should there?

I believe there should.

2. Otherwise, if the user agent can find a meta element that specifies
character encoding information (as described above), then use that.

If a conformance checker has not determined the character encoding by
now, what should it do? Should it report the document as non- conforming
(my preferred choice)? Should it default to US-ASCII and report any
non-ASCII bytes as conformance errors? Should it continue to the fuzzier
steps like browsers would (hopefully not)?

Again, I don't know.

I'll continue to treat such documents as non-conforming, then.

Currently the behaviour is very underspecified here:

   http://whatwg.org/specs/web-apps/current-work/#documentEncoding

I'd like to rewrite that bit. It will require a lot of research; of
existing authoring practices, of current UAs, and of author needs. If
anyone wants to step up and do the work, I'd be very happy to work with
them and get something sorted out here.

Disclaimer: This is not based on reading the source of the Gecko or WebKit. Instead, this is based on quick research in character encodings and on black box testing of Firefox 1.5, Opera 9.0 preview and Safari 2.0.3. Tests: http://hsivonen.iki.fi/test/wa10/encoding- detection/ (c- means that I think it should be a conforming case and nc- means that I think it should be a non-conforming case.)

It turns out that most character encodings have the property that in the initial state of the decoder the bytes 0x20–0x7E (inclusive) as well as 0x09, 0x0A and 0x0D decode to the Unicode code points of the same (zero-extended) value. Character encodings that have this property (hereafter "rough ASCII superset") include:
Big5
Big5-HKSCS
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM00858
IBM437
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM865
IBM866
IBM868
IBM869
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-1
ISO-8859-10
ISO-8859-13
ISO-8859-14
ISO-8859-15
ISO-8859-16
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
KOI8-R
KOI8-U
MacRoman
Shift_JIS
TIS-620
US-ASCII
UTF-8
VISCII
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-ARMSCII
x-Big5-Solaris
x-EUC-TW
x-IBM1006
x-IBM1046
x-IBM1098
x-IBM1124
x-IBM1381
x-IBM1383
x-IBM737
x-IBM856
x-IBM874
x-IBM921
x-IBM922
x-IBM942C
x-IBM943C
x-IBM948
x-IBM949C
x-IBM950
x-IBM970
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-JISAutoDetect
x-Johab
x-MS950-HKSCS
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRomania
x-MacThai
x-MacTurkish
x-MacUkraine
x-PCK
x-euc-jp-linux
x-eucJP-Open
x-iso-8859-11
x-iso-8859-12
x-mswin-936
x-windows-874
x-windows-949
x-windows-950

Notably, character encodings that I am aware of and do not have this property are: JIS_X0212-1990, x-JIS0208, various legacy IBM codepages, x-MacDingbat and x-MacSymbol, UTF-7, UTF-16 and UTF-32.

The x-MacDingbat and x-MacSymbol encodings are irrelevant to Web pages. After browsing the encoding menus of Firefox, Opera and Safari, I'm pretty confident that the legacy IBM codepages are irrelevant as well.

I suggest the following algorithm as a starting point. It does not handle UTF-7, CESU-8, JIS_X0212-1990 or x-JIS0208.

- -

Set the REWIND flag to unraised.

Read the first four bytes of the byte stream.

If the bytes constitute a big-endian UTF-32 BOM, set the character encoding to big-endian UTF-32 and initialize the corresponding decoder. The detection algorithm terminates.

If the bytes constitute a little-endian UTF-32 BOM, set the character encoding to littel-endian UTF-32 and initialize the corresponding decoder. The detection algorithm terminates.

If the first two bytes constitute a big-endian UTF-16 BOM, set the character encoding to big-endian UTF-16, unread the third and fourth byte and initialize the corresponding decoder. The detection algorithm terminates.

If the first two bytes constitute a little-endian UTF-16 BOM, set the character encoding to little-endian UTF-16, unread the third and fourth byte and initialize the corresponding decoder. The detection algorithm terminates.

If the first three bytes constitute a UTF-8 BOM, set the character encoding to UTF-8, unread the fourth byte and initialize the corresponding decoder. The detection algorithm terminates.

If the bytes have the pattern 0x00, 0x00, 0x00, 0x00, emit a hard parse error, unread the bytes and perform implementation-specific heuristics. Set the character encoding to the output of the heuristics. The detection algorithm terminates. (Note: need more testing here.)

If the bytes have the pattern 0x00, 0x00, 0x00, NOT-0x00, set the character encoding to UTF-32BE, emit an easy parse error, unread the bytes and initialize the corresponding decoder. The detection algorithm terminates. (Note: need more testing here.)

If the bytes have the pattern NOT-0x00, 0x00, 0x00, 0x00, set the character encoding to UTF-32LE, emit an easy parse error, unread the bytes and initialize the corresponding decoder. The detection algorithm terminates. (Note: need more testing here.)

If the first two bytes have the pattern 0x00, NOT-0x00, set the character encoding to UTF-16BE, emit an easy parse error, unread the bytes and initialize the corresponding decoder. The detection algorithm terminates. (Note: need more testing here.)

If the first two bytes have the pattern NOT-0x00, 0x00, set the character encoding to UTF-16LE, emit an easy parse error, unread the bytes and initialize the corresponding decoder. The detection algorithm terminates. (Note: need more testing here.)

Initialize a character decoder that the bytes 0x20–0x7E (inclusive) as well as 0x09, 0x0A and 0x0D decode to the Unicode code points of the same (zero-extended) value and maps all other bytes to U+FFFD and raises a REWIND flag and emits an easy parse error when doing so. If the UA supports in-place decoder switching (see below), the decoder should not buffer and should only consume one byte of the byte stream when one character is read from the decoder.

Start the HTML parser but do not execute scripts.

If the script start tag is seen and the UA supports scripting, raise the REWIND flag and emit an easy parse error.

If a start tag other than html or head is seen, emit an easy parse error.

If the end of the head element is seen, emit a hard parse error, perform implementation-specific heuristics, tear down the DOM, rewind the byte stream and restart the parser. The detection algorithm terminates.

If a meta element whose http-equiv attribute has the value "Content- Type" (compare case-insensitively) and whose content attribute has a value that begins with "text/html; charset=", the string in the content attribute following the start "text/html; charset=" is taken, white space removed from the sides and considered the tentative encoding name. (Note: Safari allows spaces, line breaks and tabs around the attribute values. Firefox allows spaces. Opera does not allow anything extra.)

If the tentative encoding name does not identify a rough ASCII superset supported by the UA, emit a hard parse error and perform implementation-specific heuristics. Set the character encoding to the output of the heuristics. If the REWIND flag has been raised, rewind the byte stream and tear down the DOM. If the REWIND flag has not been raised and the heuristics yield a rough ASCII superset, either change the decoder in place or rewind the byte stream, tear down the DOM and restart the parser. (Changing in place is recommended.) The detection algorithm terminates.

If the tentative encoding name identifies a rough ASCII superset supported by the UA, set the character encoding to the tentative encoding. If the REWIND flag has been raised, rewind the byte stream and tear down the DOM. If the REWIND flag has not been raised, either change the decoder in place or rewind the byte stream, tear down the DOM and restart the parser. (Changing in place is recommended.) The detection algorithm terminates.

Where performing implementation-specific heuristics is called for, the UA may analyze the byte spectrum using statistical methods. However, at minimum the UA must fall back on a user-chosen encoding that is rough ASCII subset. This user choice should default to Windows-1252.

- -

Requirements I'd like to see:

Documents must specify a character encoding an must use an IANA- registered encoding and must identify it using its preferred MIME name or use a BOM (with UTF-8, UTF-16 or UTF-32). UAs must recognize the preferred MIME name of every encoding they support that has a preferred MIME name. UAs should recognize IANA-registered aliases.

Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE (i.e. BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings from the EBCDIC family of encodings. Documents using the UTF-16 or UTF-32 encodings must have a BOM.

UAs must support the UTF-8 encoding.

Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.)

Authors are adviced to use the UTF-8 encoding. Authors are adviced not to use the UTF-32 encoding or legacy encodings. (Note: I think UTF-32 on the Web is harmful and utterly pointless, but Firefox and Opera support it. Also, I'd like to have some text in the spec that justifies whining about legacy encodings. On the XML side, I give warnings if the encoding is not UTF-8, UTF-16, US-ASCII or ISO-8859-1. I also warn about aliases and potential trouble with RFC 3023 rules. However, I have no spec backing for treating dangerous RFC 3023 stuff as errors.)

- -

Also, the spec should probably give guidance on what encodings need to be supported. That set should include at least UTF-8, US-ASCII, ISO-8859-1 and Windows-1252. It should probably not be larger than the intersection of the sets of encodings supported by Firefox, Opera, Safari and IE6. (It might even be useful to intersect that set with the encodings supported by JDK and Python by default.)

--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/


Reply via email to