Devanathan, Ramkumar (HP Software BTO) wrote:
hi,
i have a tough situation wherein the xml document that i need to
transform is all utf-16 (encoding specified in the xml declaration is
also utf-16) but 1 particular element in the xml field has content
that's actually utf-8. when i view this document in wordpad the utf-8
content appears as 'boxes' (basically junk).
The only way Xerces-C would be able to parse this document is if those
UTF-8 octets are such that it could successfully interpret them as part of
the UTF-16 stream. If it is doing that, the original content of the
element was mangled.
knowing that this content is always utf-8, can i still attempt a
transformation - does xalan allow for such a scenario - if so, how?
No, because Xerces-C cannot parse such a document the way you want it to.
as of now, the transformation renders the text as an empty string - this
is with Xalan 1.10.
Xalan-C does not "render" text. It generates a well-formed external
general parsed entity based on your stylesheet. If it's producing an empty
string in the result tree, it's because your stylesheet is directing to, or
because that's what's in the source tree.
How do you propose that an XML parser figure out that a certain range of
bytes in an external entity are in a different encoding? In fact, a parser
cannot do that as it violates the well-formedness constraints of the XML
recommendation:
"In the absence of information provided by an external transport protocol
(e.g. HTTP or MIME), it is a fatal error for an entity including an
encoding declaration to be presented to the XML processor in an encoding
other than that named in the declaration, or for an entity which begins
with neither a Byte Order Mark nor an encoding declaration to use an
encoding other than UTF-8. Note that since ASCII is a subset of UTF-8,
ordinary ASCII entities do not strictly need an encoding declaration."
"It is a fatal error when an XML processor encounters an entity with an
encoding that it is unable to process. It is a fatal error if an XML entity
is determined (via default, encoding declaration, or higher-level protocol)
to be in a certain encoding but contains byte sequences that are not legal
in that encoding."
The first time you have one of these "elements" with content that contains
an odd number of bytes, the parser will stop because it will not be able to
interpret the bytes of the entity correctly.
of course, the obvious solution to get the xml document fixed, is
something that we could do, but that brings other complications for
existing apps (java based) that have already factored this incongruence
within their implementation. so this is a possible but last option.
Xerces-C and Xalan-C support the XML recommendations as they are described,
not as how they are implemented by broken systems. It's beyond me why you
would want to implement a system using XML then break it in such a way that
you cannot use conforming tools.
The only hack I can think of that would work would be for you to implement
a custom InputStream that reinterprets the bytes of the "document" such
that you transcode any UTF-8 bytes into UTF-16 before the parser sees them.
Dave