On Sun, 04 Mar 2007 07:33:51 +0100, Michael Day <[EMAIL PROTECTED]> wrote:

For user agents like Prince that support XML and HTML content it is sometimes necessary to distinguish whether a .html file is actually XML or HTML in order for it to be processed correctly.

I've written an article for XML.com explaining exactly how Prince performs content sniffing to distinguish XML and HTML documents:

     What Does XML Smell Like?
     http://www.xml.com/pub/a/2007/02/28/what-does-xml-smell-like.html

Any feedback would be greatly appreciated. No doubt at some point it will be necessary to revise our heuristics for HTML5 :)

If you load a file from disk, then use any meta information the OS can provide. (I think Linux can store Content-Type information for files.) If the OS relies on file extensions (like Windows does) then use that.

.htm and .html are HTML. I know of lots of HTML documents that start with an "XML declaration" but are not well-formed if parsed as XML. (For starters, some version of DreamWeaver emitted XML declarations for documents, but did not ensure well-formedness and the result is often not well-formed.) Even if it was well-formed, it probably wasn't tested under XML conditions so it's likely that style sheets and scripts only work correctly under HTML conditions.

From the article:

| It is common for XHTML files to be given an extension of .html or .htm,
| as .xhtml is rather long and .xht is rather obscure. This means that a
| file with an extension of .html may actually be an XML document and
| require an XML parser.

This is completely bogus. Those "XHTML" files are most likely inteded to be treated as HTML and not as XML. If an author wanted it to be treated as XML he/she would use .xhtml, .xht or .xml. Even if it would work correctly with an XML parser, it would likely also work correctly with an HTML parser (since all browsers would treat it as HTML, and authors mostly test their documents in some browser).

If an author authored a document and testing it with Prince, finding that XML-only features work even with a .html file extension, then it is likely that that document would break in browsers (because XML-only features don't work in HTML).

HTML5 has specified content-sniffing rules, FWIW: http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing

See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500

--
Simon Pieters

Reply via email to