Re: [WSG] Re: Encoding test page

2006-03-16 Thread Lachlan Hunt

Keryx webb wrote:

xml-prologue


The XML Prologue is the section of the document that contains the XML
Declaration, DOCTYPE and any comments or PIs prior to the root element.
 The XML Declaration on the other hand, or (more specifically) the
encoding declaration within the XML declaration, is the special
construct that specifies the encoding and is what you're referring to.
Please try to get the terminology correct in future.

If a page is sent as XHTML, one could 
argue that it's supposed to be self-documenting, and that it might mean 
that the xml-prologue should be more important than the http-header.


See the Architecture of the WWW rec:
http://www.w3.org/TR/2004/REC-webarch-20041215/#xml-media-types

Information supplied at the protocol level always takes precedence over
anything specified in the file itself.  Therefore the HTTP Content-Type
header takes precedence over the XML declaration.

The precedence rules work like this for the following MIME types:

For application/xml, other application/*+xml
and application/xml-external-parsed-entity:
1. charset parameter in the Content-Type header
2. BOM
3. The XML declaration
4. UTF-8

The meta element is never used for determining the encoding.

For external parsed entities, it uses the Text Declaration instead of 
the XML declaration.  They're similar, but not exactly the same.


For text/xml and text/external-parsed-entity:
1. charset parameter in the Content-Type header
2. US-ASCII

The XML declaration and Text declaration is ignored.

For text/html:

a) According to the spec:
1. charset parameter in the Content-Type header
2. Meta element
3. charset attribute on the link followed to the page.

b) Actual browser implementation is a little unclear at this stage, it's 
not really well defined.  Here's a rough overview anyway:

1. charset parameter in the Content-Type header
2. BOM
3. Meta element
4. Unspecified heuristics (guessing)
5. Default (according to browser pref, which is usually ISO-8859-1 or
   Windows-1252)

I'm not sure if any UAs actually support the charset attribute for links 
at all.


--
Lachlan Hunt
http://lachy.id.au/


**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list & getting help
**



Re: [WSG] Re: Encoding test page

2006-03-14 Thread Lachlan Hunt

Andrew Cunningham wrote:

Lachlan Hunt writes:

Andrew Cunningham wrote:
In theory the docuemnt should only be in one of the unicode 
encodings, so without a BOM, the browser should try to render it as 
UTF-8.


No, because when it's served as text/html, HTML rules apply, not XML 
rules.  So without the encoding declared in the HTTP headers or the 
meta element, the default of ISO-8859-1 should be used (if served over 
HTTP, technically US-ASCII otherwise).  However, browsers will 
actually interpret ISO-8859-1 as the Windows-1252 superset and will 
also attempt to use unspecified heuristics to guess the encoding, 
before falling back to the default.


If you're going by the HTTP specs.


Yes, of course, as well as the relevant RFCs for the MIME types.


If you go by the XHTML 1.0 recomendation, appendic C


Appendix C is non-normative.

would indicate that "... that when the XML declaration is not included in a document, the 
document can only use the default character encodings UTF-8 or UTF-16.:


That is only true for XML on the condition that the encoding has not 
been specified by a higher level protocol.  The relevant *normative* 
section of the XML rec. states in 4.3.3 Character Encoding in Entities:


In the absence of information provided by an external transport protocol 
(e.g. HTTP or MIME), it is a fatal error for an entity including an 
encoding declaration to be presented to the XML processor in an encoding 
other than that named in the declaration, or for an entity which begins 
with neither a Byte Order Mark nor an encoding declaration to use an 
encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, 
ordinary ASCII entities do not strictly need an encoding declaration.


http://www.w3.org/TR/REC-xml/#charencoding

The point I wnated to make is that there is another way to declare 
encoding for docuemnts in UTF-16 or UTF-32: and thats teh BOM; and that 
the test should also include BOM detection as an option,


Not according to the HTML 4 Rec, but...


i.e. do various  web browsers use the BOM as part of their heuristics.


In reality, yes.

--
Lachlan Hunt
http://lachy.id.au/
**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list & getting help
**



Re: [WSG] Re: Encoding test page

2006-03-14 Thread Lachlan Hunt

Andrew Cunningham wrote:
I was wondering if you should have another test in there: XHTML document 
with no encoding declared in the http header or in a meta tag, and no 
xml declaration. Sent as html/text.


That's text/html and an XHTML document served as text/html is HTML, 
regardless of any lies the DOCTYPE tells you.


In theory the docuemnt should only be in one of the unicode encodings, 
so without a BOM, the browser should try to render it as UTF-8.


No, because when it's served as text/html, HTML rules apply, not XML 
rules.  So without the encoding declared in the HTTP headers or the meta 
element, the default of ISO-8859-1 should be used (if served over HTTP, 
technically US-ASCII otherwise).  However, browsers will actually 
interpret ISO-8859-1 as the Windows-1252 superset and will also attempt 
to use unspecified heuristics to guess the encoding, before falling back 
to the default.


--
Lachlan Hunt
http://lachy.id.au/
**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list & getting help
**



Re: [WSG] Re: Encoding test page

2006-03-14 Thread Anders Nawroth

Hej!

Keryx webb skrev:
That's what we were discussing. If a page is sent as XHTML, one could 
argue that it's supposed to be self-documenting, and that it might 
mean that the xml-prologue should be more important than the 
http-header. As my page proves, in FFox, MSIE and Opera (the three 
I've tested) that is not the case.

Look at:
http://www.w3.org/International/tutorials/tutorial-char-enc/en/slides/Slide0400.html



  Precedence rules

   1. HTTP Content-Type
   2. XML declaration
   3. meta charset declaration
   4. link charset attribute



Related to previous comments -- from an earlier slide of the tutorial:
http://www.w3.org/International/tutorials/tutorial-char-enc/en/slides/Slide0300.html
For these reasons you should always ensure that encoding information 
is /also/ declared inside the document.

(and not only in the HTTP headers, that is)

I think the linked tutorial covers most of the questions regarding 
declaring encodings.


/AndersN

**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list & getting help
**



Re: [WSG] Re: Encoding test page

2006-03-14 Thread Keryx webb

Andrew Cunningham wrote:

Keryx webb writes:


According to my tests Firefox *will* use the charset specified in the 
http-header over the one in the XML-prologue if a page is sent as 
application/xhtml+xml. (Or more exactly, regardless whether the page 
is sent as text/html or application/xhtml+xml.) As will Opera.


isn't that the way the browsers are supposed to operate? That the 
http-header has precedence?


Andrew Cunningham


That's what we were discussing. If a page is sent as XHTML, one could argue that 
it's supposed to be self-documenting, and that it might mean that the 
xml-prologue should be more important than the http-header. As my page proves, 
in FFox, MSIE and Opera (the three I've tested) that is not the case.


If the page is sent as application/xhtml+xml and no encoding has been specified 
in the http-header, the prologue will be used, though. If the page is sent as 
text/html Firefox will ignore the prologue even if I've excluded the encoding 
from the http-header.


Lars Gunther
**
The discussion list for  http://webstandardsgroup.org/

See http://webstandardsgroup.org/mail/guidelines.cfm
for some hints on posting to the list & getting help
**