Hi,
XML is used for many different kinds of files, and Apache serving up all XML 
files as having application/xml type unless explicitly told otherwise is 
suboptimal. For example, correct usage of more specific types can be useful for 
content negotiation: a user agent might have a preference between 'text/vcard' 
and 'application/vcard+xml', for example. Just as Apache does for XHTML, these 
files usually contain enough information to identify their correct type and, 
when the top-level element has a designated XML namespace, this can be done 
without any chance of error: determining a more specific media type for an XML 
document is then a deterministic procedure, not a matter of guesswork.

I can't find any existing solutions using Apache as the HTTP server, though, 
such as with a module. Is this just something no one has gotten around to 
implementing yet (either in the Apache HTTP Server project or on their own)? 
Has anyone solved this problem before? Here's some research I've done on the 
matter.

 • It appears there is precedent for using libxml2 to implement functionality 
in httpd, but the only obvious one is in mod_xml2enc 
https://httpd.apache.org/docs/trunk/mod/mod_xml2enc.html which is about 
handling text encodings on-the-fly as a filter. If libxml2 is already used some 
for Apache modules, then using it to parse an XML document's root element, 
namespace, and DOCTYPE declaration ought to be pretty straightforward, as a 
first step to inform the choice of a superior media type. Do any other parts of 
the server do anything like this? If not, I'm hopeful a quality implementation 
of this could be considered for inclusion in the core distribution.

 • To heuristically determine media types of files generally, mod_mime_magic 
https://httpd.apache.org/docs/trunk/mod/mod_mime_magic.html is described as 
working "like" the file(1) command. Unfortunately, Apache uses its own 
home-grown implementation for this job ("This module is derived from a free 
version of the file(1) command for Unix"), and it expects to be used with a 
MimeMagicFile in the format of the one supplied by Apache. This means that even 
when improvements are made to the file(1) command and libmagic library that the 
majority of libre systems use, it will not trickle down to Apache. Is there a 
reason for this apparent code duplication?
Maybe it comes from a time before the libmagic library 
https://www.darwinsys.com/file/ existed; curiously, that upstream project is 
the same as what mod_mime_magic is based on anyway.

 • This subject matter is based on the premise that the name of an XML 
document's root element, along with at least one of a document type declaration 
or an XML namespace declaration, can uniquely identify an XML document's kind 
and inform user agents of how to use those XML files fetched. This can be 
materialized from two different approaches, neither of which I've been able to 
pull off.

        ◦ The IANA has the registration of a media type called 
application/prs.implied-document+xml 
https://www.iana.org/assignments/media-types/application/prs.implied-document+xml
 which allows this concept in general. The comments for the registration 
express this well:
> This media type identifies a meta-format that encompasses all XML-based 
> formats which are identified by a particular name of the root element, 
> optionally together with a namespace URI or the PUBLIC identifier stored in 
> the DTD. It it intended for use in applications that describe files using 
> media types, but do not have sufficient heuristics to output a more specific 
> media type. In such a case, the application may parse XML and use the name of 
> the root element and the DTD to the "root", "ns", and "public" parameters.
It even gives an example: the common image/svg+xml type is approximately 
equivalent to
        
application/prs.implied-document+xml;root=svg;ns="http://www.w3.org/2000/svg";public="-//W3C//DTD
 SVG 1.1//EN"
If something somewhere in the pipeline could express a media type like this, 
then the canonical image/svg+xml could be substituted as an alias somewhere.

        ◦ An orthogonal issue is, in what way could we define such "canonical 
media types" that correspond to some XML document type? I am pleased to 
discover that the shared-mime-info database specification, commonly used on 
GNU/Linux, already provides for this! 
https://specifications.freedesktop.org/shared-mime-info/latest/ar01s02.html#id-1.3.9
As a matter of fact, on my Debian Trixie system, a file 
/usr/share/mime/XMLnamespaces already exists. This is a short plain text file 
with lines such as
> http://www.abisource.com/awml.dtd abiword application/x-abiword
> http://www.w3.org/1998/Math/MathML math application/mathml+xml
> http://www.w3.org/1999/xhtml html application/xhtml+xml
and so on. So regardless of whether the application/prs.implied-document+xml 
media type is used somewhere as an internal representation, this is a 
straightforward mapping that provides everything needed. If only Apache could 
use it.

If any solutions exist along these lines, I don't know of them yet but would 
love to. Otherwise, I ask of the sympathetic readership: how do you handle this?

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to