On Thu Dec 1 20:53:17 2005, Christian Rose wrote:
On 12/1/05, Kevin Krammer <[EMAIL PROTECTED]> wrote:
> Isn't an XML file considered to be in ASCII unless a different
enconding is
> specified by the processing instruction?
Not really. Unless other information is given, AFAIK an XML file is
to
be assumed to be in UTF-8.
Quote from http://www.w3.org/TR/REC-xml/#charencoding :
"In the absence of information provided by an external transport
protocol (e.g. HTTP or MIME),
Right. So what you're then asking is, "does HTTP or MIME provide a
default?", because that is still information.
RFC3023 states that the default for MIME is US-ASCII, and explicitly
defines the default for HTTP as US-ASCII, overriding HTTP's usual
default (for text/*) of ISO-8859-1. I say "implies" a default for
MIME (thus email), because I don't actually see a specified default
in RFC2046 for anything except text/plain, but RFC3023 appears to
reference that default. (It's late, I might well have missed
RFC2046's default, but I did look reasonably hard, as I wanted to
quote the text.)
So Kevin's right - if, of course, you got the opportunity for a
charset parameter, but didn't get one. If you didn't, then REC-xml
takes over.
As a consequence, a file containing only ASCII characters but no
encoding information would be valid XML. But *assuming* that any
file
without encoding information will be valid ASCII is plain wrong.
Valid
ASCII is always valid UTF-8, but not necessarily the other way
around.
Yes, this is true. The problem being that this would only be true for
a file held on a simple filesystem with no ability to provide a
content-type. If you *do* have a MIME content-type field, then by
default you have US-ASCII, since the optional charset identifier
still tells you that even when absent.
In other words, a file on a traditional filesystem which indicates
(via extension, etc) text/xml has to be treated using REC-xml Section
4.3.3, which you quoted, but one retrieved via a VFS system has to be
assumed to be US-ASCII.
Rejoice, because this is better than text/plain, which changes
default character sets depending on whether you got it from email,
web, or local disk.
But wait, because it's about to go horribly wrong. :-)
The type system that most desktops use, whether using the
freedesktop.org specification or not, uses only media types, not the
full content-type. So does this mean that we're really using MIME on
a local filesystem (we get a media-type, after all, so we assume all
optional parameters are absent), or does this really mean it isn't
MIME, and merely shares a subset of the syntax. Because that in turn
changes the default character set for text/xml, depending on your
reading of RFC2046 and RFC3023.
application/xml is joyously unaffected by this - if no character set
is specified, then you fallback with Section 4.3.3 of REC-xml,
however you got it, and which says that you either have a BOM, or use
UTF-8, or provide a (presumably ASCII compatible) encoding. (So not
*quite* UTF-8 as a default).
Dave.
--
You see things; and you say "Why?"
But I dream things that never were; and I say "Why not?"
- George Bernard Shaw
_______________________________________________
xdg mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/xdg