On Monday, 12/30/2002 at 10:52 PST, rf <[EMAIL PROTECTED]> wrote: > How does an XML reader know of the encoding of an XML > before reading it
The simple answer is, it can't. It needs to examing the XML Declaration *before* selecting the encoding. The XML Recommendation discusses this process: Look for the byte-order mark, which may or may not be present. Look for the start of the XML Declaration. Since we know it starts with "<?", we can usually recognize the general family of encodings (UTF-8-like, UTF-16-like, EBCDIC-like, etc) from those first few bytes. Use that information to interpret the rest of the XML Declaration. If an encoding was specified, read the rest of the document using that encoding. If it wasn't specified, you can/should usually assume it's UTF-8 or UTF-16. > One book > says that if you are reading an XML accross a > network(say http), then you (have to) mention the > encoding in the MIME type header. This is highly encouraged, since switching encodings after you've started reading the stream tends to be less efficient. But a correctly-implemented parser *ought* to be able to able to handle the cases where the encoding is specified only by the file. > reading files from the disk - whats the answer? If it isn't specified in the XML Declaration, the data should be read as UTF-8 or UTF-16. Some parsers may attempt to guess non-UTF encodings if you haven't specified the encoding, but that isn't reliable and shouldn't be relied upon. ______________________________________ Joe Kesselman / IBM Research --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
