Hi Simon,

If you load a file from disk, then use any meta information the OS can provide. (I think Linux can store Content-Type information for files.) If the OS relies on file extensions (like Windows does) then use that.

Some Linux file systems might potentially be capable of storing extra
metadata in extended attributes, but in practice I haven't seen any
Linux distributions actually use this functionality for storing the
content type of files. This basically leaves us with file extensions,
just like Windows.

.htm and .html are HTML. I know of lots of HTML documents that start with an "XML declaration" but are not well-formed if parsed as XML. (For starters, some version of DreamWeaver emitted XML declarations for documents, but did not ensure well-formedness and the result is often not well-formed.) Even if it was well-formed, it probably wasn't tested under XML conditions so it's likely that style sheets and scripts only work correctly under HTML conditions.

Given that Prince serves a different niche than most user agents, our users tend to be more likely to use XML with embedded SVG etc., and less likely to run Prince on documents created by DreamWeaver. When Prince is run on a document retrieved over HTTP it obeys the Content-Type header, so that documents on the web will be parsed as HTML.

However, it is true that if a document that appears to be XML but actually isn't is downloaded and saved as a file then Prince will try to load it as XML rather than HTML after sniffing the content in the absence of a Content-Type header. The user will then receive error messages if the document is not well-formed. In practice, this case does not seem to arise very often, but if it encourages people to either fix their XML and make it well-formed or stop pretending that their HTML is XML then that doesn't sound like such a bad thing :)

If an author authored a document and testing it with Prince, finding that XML-only features work even with a .html file extension, then it is likely that that document would break in browsers (because XML-only features don't work in HTML).

This comes back to the thorny issue of how MathML is supposed to work on the web. It seems to require that content be served up as XHTML, which no one does, or that HTML documents contain "XML islands", which is not well specified at all. It would be nice if HTML5 could tackle this in a way that makes sense.

HTML5 has specified content-sniffing rules, FWIW: http://www.whatwg.org/specs/web-apps/current-work/#content-type-sniffing

Yes, these rules never seem to identify a document as being XML, though.

See also: http://www.w3.org/Bugs/Public/show_bug.cgi?id=1500

Prince always respects the Content-Type header, and only sniffs document content when no such metadata is available.

Best regards,

Michael

--
Print XML with Prince!
http://www.princexml.com

Reply via email to