Currently, the spec says that document.open() sets the document's character 
encoding to UTF-16. This is what IE does except IE uses the label "unicode" 
instead of "UTF-16".
Demo: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/438

Gecko and WebKit set document's character encoding to UTF-8 even though the 
parser operates on UTF-16.
Demo: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/439

When loading external resources that don't have encoding labels, IE, Gecko and 
WebKit all use UTF-8 to decode the external resource.
Demo: http://software.hixie.ch/utilities/js/live-dom-viewer/saved/437

Opera doesn't support document.charset or document.characterSet, but demo 37 
and the demos discussed below show that Opera applies the default encoding 
(Windows-1252) to external resources referenced from document.open()ed 
documents.

Spec change request: Please change the spec to say that document.open() sets 
the document's character encoding to UTF-8 even though the parser operates on 
UTF-16 DOMStrings.

My real interest in this topic isn't so much about the initial character 
encoding value but about the effect of <meta charset> on document.open()ed 
documents.

Consider this demo in Gecko with the old HTML parser:
http://software.hixie.ch/utilities/js/live-dom-viewer/saved/434

The demo alerts two times: first showing the REPLACEMENT CHARACTER and then 
showing LATIN SMALL LETTER R WITH ACUTE. First, Gecko parses the document with 
UTF-8 as the document's character encoding. During that parse, the value 
ISO-8859-2 from the meta is added to the cache entry for this stream (see my 
earlier email about reloading document.open()ed documents). Then the document 
is implicitly reloaded with ISO-8859-2 as the document's character encoding. 
This was implemented in https://bugzilla.mozilla.org/show_bug.cgi?id=255820 
back when Gecko used UTF-16 instead of UTF-8 as the document's character 
encoding for document.open()ed docs and using UTF-16 for external resources 
made the external resources fail to parse.

Curiously, the implicit reloading behavior isn't particularly robust. In some 
situations the reload doesn't happen. I don't know what the logic is.
Demo with the order of meta and script swapped: 
http://software.hixie.ch/utilities/js/live-dom-viewer/saved/435

None of IE, WebKit or Opera let the meta charset in a document.open()ed 
document have any effect, which seems to suggests that Gecko might be trying 
unnecessarily hard in this scenario.

Due to the test case for https://bugzilla.mozilla.org/show_bug.cgi?id=255820 I 
made the meta charset change the document's character encoding (but not reload) 
when the HTML5 parser is enabled in Gecko. See demos 435 and 434 with 
html5.enable=true. However, now it seems it might be better to revert that 
change to align with IE and WebKit--unless sites now depend on the Gecko 
behavior. Do other browser vendors have data showing sites depending on Gecko's 
behavior when loading external resources for document.open()ed docs?

-- 
Henri Sivonen
[email protected]
http://hsivonen.iki.fi/


Reply via email to