Re: [whatwg] Distinguishing XML and HTML by content sniffing

Michael Day Mon, 05 Mar 2007 15:25:34 -0800

Hi Simon,

If you load a file from disk, then use any meta information the OS canprovide. (I think Linux can store Content-Type information for files.)If the OS relies on file extensions (like Windows does) then use that.


Some Linux file systems might potentially be capable of storing extra
metadata in extended attributes, but in practice I haven't seen any
Linux distributions actually use this functionality for storing the
content type of files. This basically leaves us with file extensions,
just like Windows.

.htm and .html are HTML. I know of lots of HTML documents that startwith an "XML declaration" but are not well-formed if parsed as XML. (Forstarters, some version of DreamWeaver emitted XML declarations fordocuments, but did not ensure well-formedness and the result is oftennot well-formed.) Even if it was well-formed, it probably wasn't testedunder XML conditions so it's likely that style sheets and scripts onlywork correctly under HTML conditions.

Given that Prince serves a different niche than most user agents, ourusers tend to be more likely to use XML with embedded SVG etc., and lesslikely to run Prince on documents created by DreamWeaver. When Prince isrun on a document retrieved over HTTP it obeys the Content-Type header,so that documents on the web will be parsed as HTML.

However, it is true that if a document that appears to be XML butactually isn't is downloaded and saved as a file then Prince will try toload it as XML rather than HTML after sniffing the content in theabsence of a Content-Type header. The user will then receive errormessages if the document is not well-formed. In practice, this case doesnot seem to arise very often, but if it encourages people to either fixtheir XML and make it well-formed or stop pretending that their HTML isXML then that doesn't sound like such a bad thing :)

If an author authored a document and testing it with Prince, findingthat XML-only features work even with a .html file extension, then it islikely that that document would break in browsers (because XML-onlyfeatures don't work in HTML).

This comes back to the thorny issue of how MathML is supposed to work onthe web. It seems to require that content be served up as XHTML, whichno one does, or that HTML documents contain "XML islands", which is notwell specified at all. It would be nice if HTML5 could tackle this in away that makes sense.

HTML5 has specified content-sniffing rules, FWIW:http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing


Yes, these rules never seem to identify a document as being XML, though.

See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500

Prince always respects the Content-Type header, and only sniffs documentcontent when no such metadata is available.


Best regards,

Michael

--
Print XML with Prince!
http://www.princexml.com

Re: [whatwg] Distinguishing XML and HTML by content sniffing

Reply via email to