Actually I mistyped that slightly. I didn't mean to say its just for LE. If
you force the encoding to anything, we will not check for a BOM at all,
because we don't assume we know what that forced encoding means. We could
special case the variants on UTF-16 that we know of and skip a BOM if we
found one, but thus far the thinking has been that if you are telling us
what the encoding is, then you've figured it out by looking at things like
the BOM and will have discarded them before you give us the data to play
with.
----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]
Tim Bray <[EMAIL PROTECTED]> on 01/27/2000 12:38:24 PM
Please respond to [EMAIL PROTECTED]
To: [EMAIL PROTECTED], [EMAIL PROTECTED]
cc:
Subject: Re: Xerces-C Tech Talk: Input Sources
Dean Roddey wrote:
>Be aware that, as the code stands right now, if you force the encoding on
>an entity, all internal smarts about the encoding are skipped. So, if you
>force the encoding to UTF-16LE, and there is a BOM, the parser won't try
to
>skip it, and the parse will fail.
Yech. Barf. One practical consequence is that in most cases, it will
be a bad idea to try to override. And the case of UTF-16LE (or BE) is
particurly troubled. There was passionate debate because some people
over in IETF wanted to make the BOM *forbidden* for UTF-16LE and BE; other
people felt that it could not possibly ever be a bad idea to put a BOM on
any UTF-16, and thus this would mean that the LE and BE variants
practically
speaking couldn't be used as media-types for XML. Don't know how that one
eventually settled out.
Having said that, you're probably doing the right thing. There is
a use-case, not sure how strong: some webserver out there does transcoding,
say from EUC to Shift-JIS (I'm told this actually happens) without of
course fixing up the XML declaration so you get an XML declaration that
is actually wrong and will probably cause your parse to crash & burn
unless ignored. Of course you can get around this by using application/xml
as the media type (no transcoding allowed) or even better, by not
transcoding
in the server.
In the general case, the best thing to do is to leave the parser
alone. With luck, the need for this escape hatch will be relatively
short-lived. -T.