Re: [whatwg] Distinguishing XML and HTML by content sniffing

Simon Pieters Sun, 04 Mar 2007 04:20:02 -0800

On Sun, 04 Mar 2007 07:33:51 +0100, Michael Day <[EMAIL PROTECTED]>wrote:

For user agents like Prince that support XML and HTML content it issometimes necessary to distinguish whether a .html file is actually XMLor HTML in order for it to be processed correctly.
I've written an article for XML.com explaining exactly how Princeperforms content sniffing to distinguish XML and HTML documents:
     What Does XML Smell Like?
     http://www.xml.com/pub/a/2007/02/28/what-does-xml-smell-like.html
Any feedback would be greatly appreciated. No doubt at some point itwill be necessary to revise our heuristics for HTML5 :)

If you load a file from disk, then use any meta information the OS canprovide. (I think Linux can store Content-Type information for files.) Ifthe OS relies on file extensions (like Windows does) then use that.

.htm and .html are HTML. I know of lots of HTML documents that start withan "XML declaration" but are not well-formed if parsed as XML. (Forstarters, some version of DreamWeaver emitted XML declarations fordocuments, but did not ensure well-formedness and the result is often notwell-formed.) Even if it was well-formed, it probably wasn't tested underXML conditions so it's likely that style sheets and scripts only workcorrectly under HTML conditions.


From the article:

| It is common for XHTML files to be given an extension of .html or .htm,
| as .xhtml is rather long and .xht is rather obscure. This means that a
| file with an extension of .html may actually be an XML document and
| require an XML parser.

This is completely bogus. Those "XHTML" files are most likely inteded tobe treated as HTML and not as XML. If an author wanted it to be treated asXML he/she would use .xhtml, .xht or .xml. Even if it would work correctlywith an XML parser, it would likely also work correctly with an HTMLparser (since all browsers would treat it as HTML, and authors mostly testtheir documents in some browser).

If an author authored a document and testing it with Prince, finding thatXML-only features work even with a .html file extension, then it is likelythat that document would break in browsers (because XML-only featuresdon't work in HTML).

HTML5 has specified content-sniffing rules, FWIW:http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing


See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500

--
Simon Pieters

Re: [whatwg] Distinguishing XML and HTML by content sniffing

Reply via email to