Re: [WSG] Re: Encoding test page
Keryx webb wrote: xml-prologue The XML Prologue is the section of the document that contains the XML Declaration, DOCTYPE and any comments or PIs prior to the root element. The XML Declaration on the other hand, or (more specifically) the encoding declaration within the XML declaration, is the special construct that specifies the encoding and is what you're referring to. Please try to get the terminology correct in future. If a page is sent as XHTML, one could argue that it's supposed to be self-documenting, and that it might mean that the xml-prologue should be more important than the http-header. See the Architecture of the WWW rec: http://www.w3.org/TR/2004/REC-webarch-20041215/#xml-media-types Information supplied at the protocol level always takes precedence over anything specified in the file itself. Therefore the HTTP Content-Type header takes precedence over the XML declaration. The precedence rules work like this for the following MIME types: For application/xml, other application/*+xml and application/xml-external-parsed-entity: 1. charset parameter in the Content-Type header 2. BOM 3. The XML declaration 4. UTF-8 The meta element is never used for determining the encoding. For external parsed entities, it uses the Text Declaration instead of the XML declaration. They're similar, but not exactly the same. For text/xml and text/external-parsed-entity: 1. charset parameter in the Content-Type header 2. US-ASCII The XML declaration and Text declaration is ignored. For text/html: a) According to the spec: 1. charset parameter in the Content-Type header 2. Meta element 3. charset attribute on the link followed to the page. b) Actual browser implementation is a little unclear at this stage, it's not really well defined. Here's a rough overview anyway: 1. charset parameter in the Content-Type header 2. BOM 3. Meta element 4. Unspecified heuristics (guessing) 5. Default (according to browser pref, which is usually ISO-8859-1 or Windows-1252) I'm not sure if any UAs actually support the charset attribute for links at all. -- Lachlan Hunt http://lachy.id.au/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
Re: [WSG] Re: Encoding test page
Andrew Cunningham wrote: Lachlan Hunt writes: Andrew Cunningham wrote: In theory the docuemnt should only be in one of the unicode encodings, so without a BOM, the browser should try to render it as UTF-8. No, because when it's served as text/html, HTML rules apply, not XML rules. So without the encoding declared in the HTTP headers or the meta element, the default of ISO-8859-1 should be used (if served over HTTP, technically US-ASCII otherwise). However, browsers will actually interpret ISO-8859-1 as the Windows-1252 superset and will also attempt to use unspecified heuristics to guess the encoding, before falling back to the default. If you're going by the HTTP specs. Yes, of course, as well as the relevant RFCs for the MIME types. If you go by the XHTML 1.0 recomendation, appendic C Appendix C is non-normative. would indicate that "... that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16.: That is only true for XML on the condition that the encoding has not been specified by a higher level protocol. The relevant *normative* section of the XML rec. states in 4.3.3 Character Encoding in Entities: In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration. http://www.w3.org/TR/REC-xml/#charencoding The point I wnated to make is that there is another way to declare encoding for docuemnts in UTF-16 or UTF-32: and thats teh BOM; and that the test should also include BOM detection as an option, Not according to the HTML 4 Rec, but... i.e. do various web browsers use the BOM as part of their heuristics. In reality, yes. -- Lachlan Hunt http://lachy.id.au/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
[WSG] Re: Encoding test page
Lachlan Hunt writes: Andrew Cunningham wrote: I was wondering if you should have another test in there: XHTML document with no encoding declared in the http header or in a meta tag, and no xml declaration. Sent as html/text. That's text/html and an XHTML document served as text/html is HTML, regardless of any lies the DOCTYPE tells you. opps ... yep text/html, more sleep last night would jhave helped ;) In theory the docuemnt should only be in one of the unicode encodings, so without a BOM, the browser should try to render it as UTF-8. No, because when it's served as text/html, HTML rules apply, not XML rules. So without the encoding declared in the HTTP headers or the meta element, the default of ISO-8859-1 should be used (if served over HTTP, technically US-ASCII otherwise). However, browsers will actually interpret ISO-8859-1 as the Windows-1252 superset and will also attempt to use unspecified heuristics to guess the encoding, before falling back to the default. If you're going by the HTTP specs. If you go by the XHTML 1.0 recomendation, appendic C would indicate that "... that when the XML declaration is not included in a document, the document can only use the default character encodings UTF-8 or UTF-16.: But all that is neither here nor there. I'm not fussed about the whole HTML vs XHTML debate. The point I wnated to make is that there is another way to declare encoding for docuemnts in UTF-16 or UTF-32: and thats teh BOM; and that the test should also include BOM detection as an option, i.e. do various web browsers use the BOM as part of their heuristics. As it is web browsers do some odd things. You've alreday mentioned the iso-8859-1 -> Windows-1252 behaviour, likewise Gb2312->GBK, Big5 and avrious supersets of it, etc. It is unfortunate behaviour. Things would be more straight forward if browsers didn't do this. If you need to do an encoding conversion on a document before processing the document, we find that in most cases you can rely on the declared encoding within a document. But there will be cases where this will not work. In some cases we have to track declared and actual encodings of external documents. Unfortunate, but necessary. Andrew ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
Re: [WSG] Re: Encoding test page
Andrew Cunningham wrote: I was wondering if you should have another test in there: XHTML document with no encoding declared in the http header or in a meta tag, and no xml declaration. Sent as html/text. That's text/html and an XHTML document served as text/html is HTML, regardless of any lies the DOCTYPE tells you. In theory the docuemnt should only be in one of the unicode encodings, so without a BOM, the browser should try to render it as UTF-8. No, because when it's served as text/html, HTML rules apply, not XML rules. So without the encoding declared in the HTTP headers or the meta element, the default of ISO-8859-1 should be used (if served over HTTP, technically US-ASCII otherwise). However, browsers will actually interpret ISO-8859-1 as the Windows-1252 superset and will also attempt to use unspecified heuristics to guess the encoding, before falling back to the default. -- Lachlan Hunt http://lachy.id.au/ ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
[WSG] Re: Encoding test page
Keryx webb writes: Andrew Cunningham wrote: Keryx webb writes: That's what we were discussing. If a page is sent as XHTML, one could argue that it's supposed to be self-documenting, and that it might mean that the xml-prologue should be more important than the http-header. As my page proves, in FFox, MSIE and Opera (the three I've tested) that is not the case. of course, since the specs give priority to the http header. Comes down to how the character encoding is declared and how servers are configured. And wether the correct character encoding is declared. I was wondering if you should have another test in there: XHTML document with no encoding declared in the http header or in a meta tag, and no xml declaration. Sent as html/text. In theory the docuemnt should only be in one of the unicode encodings, so without a BOM, the browser should try to render it as UTF-8. If the page is sent as application/xhtml+xml and no encoding has been specified in the http-header, the prologue will be used, though. If the page is sent as text/html Firefox will ignore the prologue even if I've excluded the encoding from the http-header. Yep, thats as it should be. Andrew ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
Re: [WSG] Re: Encoding test page
Hej! Keryx webb skrev: That's what we were discussing. If a page is sent as XHTML, one could argue that it's supposed to be self-documenting, and that it might mean that the xml-prologue should be more important than the http-header. As my page proves, in FFox, MSIE and Opera (the three I've tested) that is not the case. Look at: http://www.w3.org/International/tutorials/tutorial-char-enc/en/slides/Slide0400.html Precedence rules 1. HTTP Content-Type 2. XML declaration 3. meta charset declaration 4. link charset attribute Related to previous comments -- from an earlier slide of the tutorial: http://www.w3.org/International/tutorials/tutorial-char-enc/en/slides/Slide0300.html For these reasons you should always ensure that encoding information is /also/ declared inside the document. (and not only in the HTTP headers, that is) I think the linked tutorial covers most of the questions regarding declaring encodings. /AndersN ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
Re: [WSG] Re: Encoding test page
Andrew Cunningham wrote: Keryx webb writes: According to my tests Firefox *will* use the charset specified in the http-header over the one in the XML-prologue if a page is sent as application/xhtml+xml. (Or more exactly, regardless whether the page is sent as text/html or application/xhtml+xml.) As will Opera. isn't that the way the browsers are supposed to operate? That the http-header has precedence? Andrew Cunningham That's what we were discussing. If a page is sent as XHTML, one could argue that it's supposed to be self-documenting, and that it might mean that the xml-prologue should be more important than the http-header. As my page proves, in FFox, MSIE and Opera (the three I've tested) that is not the case. If the page is sent as application/xhtml+xml and no encoding has been specified in the http-header, the prologue will be used, though. If the page is sent as text/html Firefox will ignore the prologue even if I've excluded the encoding from the http-header. Lars Gunther ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **
[WSG] Re: Encoding test page
Keryx webb writes: According to my tests Firefox *will* use the charset specified in the http-header over the one in the XML-prologue if a page is sent as application/xhtml+xml. (Or more exactly, regardless whether the page is sent as text/html or application/xhtml+xml.) As will Opera. isn't that the way the browsers are supposed to operate? That the http-header has precedence? Andrew Cunningham Multicultural Officer Public Libraries Unit, Vicnet State Library of Victoria Australia ** The discussion list for http://webstandardsgroup.org/ See http://webstandardsgroup.org/mail/guidelines.cfm for some hints on posting to the list & getting help **