Re: [whatwg] Internal character encoding declaration

Henri Sivonen Sat, 11 Mar 2006 07:10:39 -0800


On Mar 10, 2006, at 22:49, Ian Hickson wrote:

I'm actually considering just requiring that UAs support rewinding (by
defining the exact semantics of how to parse for the <meta>header). Is
this something people would object to?

I think allowing in-place decoder change (when feasible) would begood for performance.

I think it would be beneficial to additionally stipulate that
1. The meta element-based character encoding informationdeclaration isexpected to work only if the Basic Latin range of characters mapsto the same
bytes as in the US-ASCII encoding.
Is this realistic? I'm not really familiar enough with characterencodings
to say if this is what happens in general.


I suppose it is realistic. See below.

2. If there is no external character encoding information nor aBOM (see
below), there MUST NOT be any non-ASCII bytes in the document byte
stream before the end of the meta element that declares the character
encoding. (In practice this would ban unescaped non-ASCII classnames onthe html and [head] elements and non-ASCII comments at thebeginning of
the document.)
Again, can we realistically require this? I need to do some studies of
non-latin pages, I guess.


As UA behavior, no. As a conformance requirement, maybe.

Authors should avoid including inline character encodinginformation.
Character encoding information should instead be included at the
transport level (e.g. using the HTTP Content-Type header).
I disagree.
With HTML with contemporary UAs, there is no real harm inincluding thecharacter encoding information both on the HTTP level and in themeta as
long as the information is not contradictory. On the contrary, the
author-provided internal information is actually useful when endusers
save pages to disk using UAs that do not reserialize with internal
character encoding information.
...and it breaks everything when you have a transcoding proxy, orsimilar.

Well, not until you save to disk, since HTTP takes precedence.However, authors can escape this by using UTF-8. (Assuming here thattampering with UTF-8 would be harmful, wrong and pointless.)

Interestingly, transcoding proxies tend to be brought up by residentsof Western Europe, North America or the Commonwealth. I have neverseen a Russion person living in Russia or a Japanese person living inJapan talk about transcoding proxies in any online or offlinediscussion. That's why I doubt the importance of transcoding proxies.


FWIW, I think Opera Mini is a distributed UA--not a proxy and a UA.

Character encoding information shouldn't be duplicated, IMHO,that's just
asking for trouble.

I suggest a mismatch be considered an easy parse error and,therefore, reportable.

For HTML, user agents must use the following algorithm indetermining the
character encoding of a document:
1. If the transport layer specifies an encoding, use that.
Shouldn't there be a BOM-sniffing step here? (UTF-16 and UTF-8only; UTF-32
makes no practical sense for interchange on the Web.)
I don't know, should there?


I believe there should.

2. Otherwise, if the user agent can find a meta element thatspecifies
character encoding information (as described above), then use that.
If a conformance checker has not determined the character encoding by
now, what should it do? Should it report the document as non-conforming
(my preferred choice)? Should it default to US-ASCII and report any
non-ASCII bytes as conformance errors? Should it continue to thefuzzier
steps like browsers would (hopefully not)?
Again, I don't know.


I'll continue to treat such documents as non-conforming, then.

Currently the behaviour is very underspecified here:

   http://whatwg.org/specs/web-apps/current-work/#documentEncoding

I'd like to rewrite that bit. It will require a lot of research; of
existing authoring practices, of current UAs, and of author needs. If

anyone wants to step up and do the work, I'd be very happy to workwith

them and get something sorted out here.

Disclaimer: This is not based on reading the source of the Gecko orWebKit. Instead, this is based on quick research in characterencodings and on black box testing of Firefox 1.5, Opera 9.0 previewand Safari 2.0.3. Tests: http://hsivonen.iki.fi/test/wa10/encoding-detection/ (c- means that I think it should be a conforming case andnc- means that I think it should be a non-conforming case.)

It turns out that most character encodings have the property that inthe initial state of the decoder the bytes 0x20–0x7E (inclusive) aswell as 0x09, 0x0A and 0x0D decode to the Unicode code points of thesame (zero-extended) value. Character encodings that have thisproperty (hereafter "rough ASCII superset") include:

Big5
Big5-HKSCS
EUC-JP
EUC-KR
GB18030
GB2312
GBK
IBM00858
IBM437
IBM775
IBM850
IBM852
IBM855
IBM857
IBM860
IBM861
IBM862
IBM863
IBM865
IBM866
IBM868
IBM869
ISO-2022-CN
ISO-2022-JP
ISO-2022-KR
ISO-8859-1
ISO-8859-10
ISO-8859-13
ISO-8859-14
ISO-8859-15
ISO-8859-16
ISO-8859-2
ISO-8859-3
ISO-8859-4
ISO-8859-5
ISO-8859-6
ISO-8859-7
ISO-8859-8
ISO-8859-9
JIS_X0201
KOI8-R
KOI8-U
MacRoman
Shift_JIS
TIS-620
US-ASCII
UTF-8
VISCII
windows-1250
windows-1251
windows-1252
windows-1253
windows-1254
windows-1255
windows-1256
windows-1257
windows-1258
windows-31j
x-ARMSCII
x-Big5-Solaris
x-EUC-TW
x-IBM1006
x-IBM1046
x-IBM1098
x-IBM1124
x-IBM1381
x-IBM1383
x-IBM737
x-IBM856
x-IBM874
x-IBM921
x-IBM922
x-IBM942C
x-IBM943C
x-IBM948
x-IBM949C
x-IBM950
x-IBM970
x-ISO-2022-CN-CNS
x-ISO-2022-CN-GB
x-JISAutoDetect
x-Johab
x-MS950-HKSCS
x-MacArabic
x-MacCentralEurope
x-MacCroatian
x-MacCyrillic
x-MacGreek
x-MacHebrew
x-MacIceland
x-MacRomania
x-MacThai
x-MacTurkish
x-MacUkraine
x-PCK
x-euc-jp-linux
x-eucJP-Open
x-iso-8859-11
x-iso-8859-12
x-mswin-936
x-windows-874
x-windows-949
x-windows-950

Notably, character encodings that I am aware of and do not have thisproperty are:JIS_X0212-1990, x-JIS0208, various legacy IBM codepages, x-MacDingbatand x-MacSymbol, UTF-7, UTF-16 and UTF-32.

The x-MacDingbat and x-MacSymbol encodings are irrelevant to Webpages. After browsing the encoding menus of Firefox, Opera andSafari, I'm pretty confident that the legacy IBM codepages areirrelevant as well.

I suggest the following algorithm as a starting point. It does nothandle UTF-7, CESU-8, JIS_X0212-1990 or x-JIS0208.


- -

Set the REWIND flag to unraised.

Read the first four bytes of the byte stream.

If the bytes constitute a big-endian UTF-32 BOM, set the characterencoding to big-endian UTF-32 and initialize the correspondingdecoder. The detection algorithm terminates.

If the bytes constitute a little-endian UTF-32 BOM, set the characterencoding to littel-endian UTF-32 and initialize the correspondingdecoder. The detection algorithm terminates.

If the first two bytes constitute a big-endian UTF-16 BOM, set thecharacter encoding to big-endian UTF-16, unread the third and fourthbyte and initialize the corresponding decoder. The detectionalgorithm terminates.

If the first two bytes constitute a little-endian UTF-16 BOM, set thecharacter encoding to little-endian UTF-16, unread the third andfourth byte and initialize the corresponding decoder. The detectionalgorithm terminates.

If the first three bytes constitute a UTF-8 BOM, set the characterencoding to UTF-8, unread the fourth byte and initialize thecorresponding decoder. The detection algorithm terminates.

If the bytes have the pattern 0x00, 0x00, 0x00, 0x00, emit a hardparse error, unread the bytes and perform implementation-specificheuristics. Set the character encoding to the output of theheuristics. The detection algorithm terminates. (Note: need moretesting here.)

If the bytes have the pattern 0x00, 0x00, 0x00, NOT-0x00, set thecharacter encoding to UTF-32BE, emit an easy parse error, unread thebytes and initialize the corresponding decoder. The detectionalgorithm terminates. (Note: need more testing here.)

If the bytes have the pattern NOT-0x00, 0x00, 0x00, 0x00, set thecharacter encoding to UTF-32LE, emit an easy parse error, unread thebytes and initialize the corresponding decoder. The detectionalgorithm terminates. (Note: need more testing here.)

If the first two bytes have the pattern 0x00, NOT-0x00, set thecharacter encoding to UTF-16BE, emit an easy parse error, unread thebytes and initialize the corresponding decoder. The detectionalgorithm terminates. (Note: need more testing here.)

If the first two bytes have the pattern NOT-0x00, 0x00, set thecharacter encoding to UTF-16LE, emit an easy parse error, unread thebytes and initialize the corresponding decoder. The detectionalgorithm terminates. (Note: need more testing here.)

Initialize a character decoder that the bytes 0x20–0x7E (inclusive)as well as 0x09, 0x0A and 0x0D decode to the Unicode code points ofthe same (zero-extended) value and maps all other bytes to U+FFFD andraises a REWIND flag and emits an easy parse error when doing so. Ifthe UA supports in-place decoder switching (see below), the decodershould not buffer and should only consume one byte of the byte streamwhen one character is read from the decoder.


Start the HTML parser but do not execute scripts.

If the script start tag is seen and the UA supports scripting, raisethe REWIND flag and emit an easy parse error.

If a start tag other than html or head is seen, emit an easy parseerror.

If the end of the head element is seen, emit a hard parse error,perform implementation-specific heuristics, tear down the DOM, rewindthe byte stream and restart the parser. The detection algorithmterminates.

If a meta element whose http-equiv attribute has the value "Content-Type" (compare case-insensitively) and whose content attribute has avalue that begins with "text/html; charset=", the string in thecontent attribute following the start "text/html; charset=" is taken,white space removed from the sides and considered the tentativeencoding name.(Note: Safari allows spaces, line breaks and tabs around theattribute values. Firefox allows spaces. Opera does not allowanything extra.)

If the tentative encoding name does not identify a rough ASCIIsuperset supported by the UA, emit a hard parse error and performimplementation-specific heuristics. Set the character encoding to theoutput of the heuristics. If the REWIND flag has been raised, rewindthe byte stream and tear down the DOM. If the REWIND flag has notbeen raised and the heuristics yield a rough ASCII superset, eitherchange the decoder in place or rewind the byte stream, tear down theDOM and restart the parser. (Changing in place is recommended.) Thedetection algorithm terminates.

If the tentative encoding name identifies a rough ASCII supersetsupported by the UA, set the character encoding to the tentativeencoding. If the REWIND flag has been raised, rewind the byte streamand tear down the DOM. If the REWIND flag has not been raised, eitherchange the decoder in place or rewind the byte stream, tear down theDOM and restart the parser. (Changing in place is recommended.) Thedetection algorithm terminates.

Where performing implementation-specific heuristics is called for,the UA may analyze the byte spectrum using statistical methods.However, at minimum the UA must fall back on a user-chosen encodingthat is rough ASCII subset. This user choice should default toWindows-1252.


- -

Requirements I'd like to see:

Documents must specify a character encoding an must use an IANA-registered encoding and must identify it using its preferred MIMEname or use a BOM (with UTF-8, UTF-16 or UTF-32). UAs must recognizethe preferred MIME name of every encoding they support that has apreferred MIME name. UAs should recognize IANA-registered aliases.

Documents must not use UTF-EBCDIC, BOCU-1, CESU-8, UTF-7, UTF-16BE(i.e. BOMless), UTF-16LE, UTF-32BE, UTF-32LE or any encodings fromthe EBCDIC family of encodings. Documents using the UTF-16 or UTF-32encodings must have a BOM.


UAs must support the UTF-8 encoding.

Encoding errors are easy parse errors. (Emit U+FFFD on bogus data.)

Authors are adviced to use the UTF-8 encoding. Authors are advicednot to use the UTF-32 encoding or legacy encodings. (Note: I thinkUTF-32 on the Web is harmful and utterly pointless, but Firefox andOpera support it. Also, I'd like to have some text in the spec thatjustifies whining about legacy encodings. On the XML side, I givewarnings if the encoding is not UTF-8, UTF-16, US-ASCII orISO-8859-1. I also warn about aliases and potential trouble with RFC3023 rules. However, I have no spec backing for treating dangerousRFC 3023 stuff as errors.)

- -

Also, the spec should probably give guidance on what encodings needto be supported. That set should include at least UTF-8, US-ASCII,ISO-8859-1 and Windows-1252. It should probably not be larger thanthe intersection of the sets of encodings supported by Firefox,Opera, Safari and IE6. (It might even be useful to intersect that setwith the encodings supported by JDK and Python by default.)


--
Henri Sivonen
[EMAIL PROTECTED]
http://hsivonen.iki.fi/

Re: [whatwg] Internal character encoding declaration

Reply via email to