Thanks for the inputs, Dave. we will try this out and get back to the list.
Ramkumar Devanathan HP Software BTO R&D Bangalore India PHONE: (91)-80-251-67214 -----Original Message----- From: David Bertoni [mailto:[EMAIL PROTECTED] Sent: Saturday, November 10, 2007 12:55 AM To: xalan-c-users@xml.apache.org Subject: Re: xml documents with multiple encoded text Devanathan, Ramkumar (HP Software BTO) wrote: > hi, > > i have a tough situation wherein the xml document that i need to > transform is all utf-16 (encoding specified in the xml declaration is > also utf-16) but 1 particular element in the xml field has content > that's actually utf-8. when i view this document in wordpad the utf-8 > content appears as 'boxes' (basically junk). The only way Xerces-C would be able to parse this document is if those UTF-8 octets are such that it could successfully interpret them as part of the UTF-16 stream. If it is doing that, the original content of the element was mangled. > > knowing that this content is always utf-8, can i still attempt a > transformation - does xalan allow for such a scenario - if so, how? No, because Xerces-C cannot parse such a document the way you want it to. > > as of now, the transformation renders the text as an empty string - this > is with Xalan 1.10. Xalan-C does not "render" text. It generates a well-formed external general parsed entity based on your stylesheet. If it's producing an empty string in the result tree, it's because your stylesheet is directing to, or because that's what's in the source tree. How do you propose that an XML parser figure out that a certain range of bytes in an external entity are in a different encoding? In fact, a parser cannot do that as it violates the well-formedness constraints of the XML recommendation: "In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration." "It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding." The first time you have one of these "elements" with content that contains an odd number of bytes, the parser will stop because it will not be able to interpret the bytes of the entity correctly. > > of course, the obvious solution to get the xml document fixed, is > something that we could do, but that brings other complications for > existing apps (java based) that have already factored this incongruence > within their implementation. so this is a possible but last option. Xerces-C and Xalan-C support the XML recommendations as they are described, not as how they are implemented by broken systems. It's beyond me why you would want to implement a system using XML then break it in such a way that you cannot use conforming tools. The only hack I can think of that would work would be for you to implement a custom InputStream that reinterprets the bytes of the "document" such that you transcode any UTF-8 bytes into UTF-16 before the parser sees them. Dave