On Thu Dec  1 20:53:17 2005, Christian Rose wrote:
On 12/1/05, Kevin Krammer <[EMAIL PROTECTED]> wrote:
> Isn't an XML file considered to be in ASCII unless a different enconding is
> specified by the processing instruction?

Not really. Unless other information is given, AFAIK an XML file is to
be assumed to be in UTF-8.
Quote from http://www.w3.org/TR/REC-xml/#charencoding :

"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME),

Right. So what you're then asking is, "does HTTP or MIME provide a default?", because that is still information.

RFC3023 states that the default for MIME is US-ASCII, and explicitly defines the default for HTTP as US-ASCII, overriding HTTP's usual default (for text/*) of ISO-8859-1. I say "implies" a default for MIME (thus email), because I don't actually see a specified default in RFC2046 for anything except text/plain, but RFC3023 appears to reference that default. (It's late, I might well have missed RFC2046's default, but I did look reasonably hard, as I wanted to quote the text.)

So Kevin's right - if, of course, you got the opportunity for a charset parameter, but didn't get one. If you didn't, then REC-xml takes over.

As a consequence, a file containing only ASCII characters but no
encoding information would be valid XML. But *assuming* that any file without encoding information will be valid ASCII is plain wrong. Valid ASCII is always valid UTF-8, but not necessarily the other way around.

Yes, this is true. The problem being that this would only be true for a file held on a simple filesystem with no ability to provide a content-type. If you *do* have a MIME content-type field, then by default you have US-ASCII, since the optional charset identifier still tells you that even when absent.

In other words, a file on a traditional filesystem which indicates (via extension, etc) text/xml has to be treated using REC-xml Section 4.3.3, which you quoted, but one retrieved via a VFS system has to be assumed to be US-ASCII.

Rejoice, because this is better than text/plain, which changes default character sets depending on whether you got it from email, web, or local disk.

But wait, because it's about to go horribly wrong. :-)

The type system that most desktops use, whether using the freedesktop.org specification or not, uses only media types, not the full content-type. So does this mean that we're really using MIME on a local filesystem (we get a media-type, after all, so we assume all optional parameters are absent), or does this really mean it isn't MIME, and merely shares a subset of the syntax. Because that in turn changes the default character set for text/xml, depending on your reading of RFC2046 and RFC3023.

application/xml is joyously unaffected by this - if no character set is specified, then you fallback with Section 4.3.3 of REC-xml, however you got it, and which says that you either have a BOM, or use UTF-8, or provide a (presumably ASCII compatible) encoding. (So not *quite* UTF-8 as a default).

Dave.
--
          You see things; and you say "Why?"
  But I dream things that never were; and I say "Why not?"
   - George Bernard Shaw
_______________________________________________
xdg mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/xdg

Reply via email to