Re: Encoding and mystery

Joseph Kesselman 3 Jan 2003 02:45:57 -0000

On Monday, 12/30/2002 at 10:52 PST, rf <[EMAIL PROTECTED]> wrote:
> How does an XML reader know of the encoding of an XML
> before reading it


The simple answer is, it can't. It needs to examing the XML Declaration 
*before* selecting the encoding. The XML Recommendation discusses this 
process:

Look for the byte-order mark, which may or may not be present.

Look for the start of the XML Declaration. Since we know it starts with 
"<?", we can usually recognize the general family of encodings 
(UTF-8-like, UTF-16-like, EBCDIC-like, etc) from those first few bytes.

Use that information to interpret the rest of the XML Declaration. If an 
encoding was specified, read the rest of the document using that encoding. 
If it wasn't specified, you can/should usually assume it's UTF-8 or 
UTF-16.

> One book
> says that if you are reading an XML accross a
> network(say http), then you (have to) mention the
> encoding in the MIME type header.

This is highly encouraged, since switching encodings after you've started 
reading the stream tends to be less efficient. But a correctly-implemented 
parser *ought* to be able to able to handle the cases where the encoding 
is specified only by the file.

> reading files from the disk - whats the answer?

If it isn't specified in the XML Declaration, the data should be read as 
UTF-8 or UTF-16. Some parsers may attempt to guess non-UTF encodings if 
you haven't specified the encoding, but that isn't reliable and shouldn't 
be relied upon.
______________________________________
Joe Kesselman  / IBM Research


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Encoding and mystery

Reply via email to