Re: xml documents with multiple encoded text

David Bertoni Fri, 09 Nov 2007 11:24:50 -0800

Devanathan, Ramkumar (HP Software BTO) wrote:

hi,
i have a tough situation wherein the xml document that i need totransform is all utf-16 (encoding specified in the xml declaration isalso utf-16) but 1 particular element in the xml field has contentthat's actually utf-8. when i view this document in wordpad the utf-8content appears as 'boxes' (basically junk).

The only way Xerces-C would be able to parse this document is if thoseUTF-8 octets are such that it could successfully interpret them as part ofthe UTF-16 stream. If it is doing that, the original content of theelement was mangled.

knowing that this content is always utf-8, can i still attempt atransformation - does xalan allow for such a scenario - if so, how?

No, because Xerces-C cannot parse such a document the way you want it to.

as of now, the transformation renders the text as an empty string - thisis with Xalan 1.10.

Xalan-C does not "render" text. It generates a well-formed externalgeneral parsed entity based on your stylesheet. If it's producing an emptystring in the result tree, it's because your stylesheet is directing to, orbecause that's what's in the source tree.

How do you propose that an XML parser figure out that a certain range ofbytes in an external entity are in a different encoding? In fact, a parsercannot do that as it violates the well-formedness constraints of the XMLrecommendation:

"In the absence of information provided by an external transport protocol(e.g. HTTP or MIME), it is a fatal error for an entity including anencoding declaration to be presented to the XML processor in an encodingother than that named in the declaration, or for an entity which beginswith neither a Byte Order Mark nor an encoding declaration to use anencoding other than UTF-8. Note that since ASCII is a subset of UTF-8,ordinary ASCII entities do not strictly need an encoding declaration."

"It is a fatal error when an XML processor encounters an entity with anencoding that it is unable to process. It is a fatal error if an XML entityis determined (via default, encoding declaration, or higher-level protocol)to be in a certain encoding but contains byte sequences that are not legalin that encoding."

The first time you have one of these "elements" with content that containsan odd number of bytes, the parser will stop because it will not be able tointerpret the bytes of the entity correctly.

of course, the obvious solution to get the xml document fixed, issomething that we could do, but that brings other complications forexisting apps (java based) that have already factored this incongruencewithin their implementation. so this is a possible but last option.

Xerces-C and Xalan-C support the XML recommendations as they are described,not as how they are implemented by broken systems. It's beyond me why youwould want to implement a system using XML then break it in such a way thatyou cannot use conforming tools.

The only hack I can think of that would work would be for you to implementa custom InputStream that reinterprets the bytes of the "document" suchthat you transcode any UTF-8 bytes into UTF-16 before the parser sees them.


Dave

Re: xml documents with multiple encoded text

Reply via email to