RE: xml documents with multiple encoded text

Devanathan, Ramkumar (HP Software BTO) Fri, 09 Nov 2007 23:26:48 -0800

Thanks for the inputs, Dave.

we will try this out and get back to the list.

Ramkumar Devanathan
HP Software BTO R&D
Bangalore
India
PHONE: (91)-80-251-67214

-----Original Message-----
From: David Bertoni [mailto:[EMAIL PROTECTED] 
Sent: Saturday, November 10, 2007 12:55 AM
To: xalan-c-users@xml.apache.org
Subject: Re: xml documents with multiple encoded text

Devanathan, Ramkumar (HP Software BTO) wrote:
> hi,
>  
> i have a tough situation wherein the xml document that i need to 
> transform is all utf-16 (encoding specified in the xml declaration is 
> also utf-16) but 1 particular element in the xml field has content 
> that's actually utf-8. when i view this document in wordpad the utf-8 
> content appears as 'boxes' (basically junk).
The only way Xerces-C would be able to parse this document is if those 
UTF-8 octets are such that it could successfully interpret them as part
of 
the UTF-16 stream.  If it is doing that, the original content of the 
element was mangled.

>  
> knowing that this content is always utf-8, can i still attempt a 
> transformation - does xalan allow for such a scenario - if so, how?
No, because Xerces-C cannot parse such a document the way you want it
to.

>  
> as of now, the transformation renders the text as an empty string -
this 
> is with Xalan 1.10.
Xalan-C does not "render" text.  It generates a well-formed external 
general parsed entity based on your stylesheet.  If it's producing an
empty 
string in the result tree, it's because your stylesheet is directing to,
or 
because that's what's in the source tree.

How do you propose that an XML parser figure out that a certain range of

bytes in an external entity are in a different encoding?  In fact, a
parser 
cannot do that as it violates the well-formedness constraints of the XML

recommendation:

"In the absence of information provided by an external transport
protocol 
(e.g. HTTP or MIME), it is a fatal error for an entity including an 
encoding declaration to be presented to the XML processor in an encoding

other than that named in the declaration, or for an entity which begins 
with neither a Byte Order Mark nor an encoding declaration to use an 
encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, 
ordinary ASCII entities do not strictly need an encoding declaration."

"It is a fatal error when an XML processor encounters an entity with an 
encoding that it is unable to process. It is a fatal error if an XML
entity 
is determined (via default, encoding declaration, or higher-level
protocol) 
to be in a certain encoding but contains byte sequences that are not
legal 
in that encoding."

The first time you have one of these "elements" with content that
contains 
an odd number of bytes, the parser will stop because it will not be able
to 
  interpret the bytes of the entity correctly.

>  
> of course, the obvious solution to get the xml document fixed, is 
> something that we could do, but that brings other complications for 
> existing apps (java based) that have already factored this
incongruence 
> within their implementation. so this is a possible but last option.
Xerces-C and Xalan-C support the XML recommendations as they are
described, 
not as how they are implemented by broken systems.  It's beyond me why
you 
would want to implement a system using XML then break it in such a way
that 
you cannot use conforming tools.

The only hack I can think of that would work would be for you to
implement 
a custom InputStream that reinterprets the bytes of the "document" such 
that you transcode any UTF-8 bytes into UTF-16 before the parser sees
them.

Dave

RE: xml documents with multiple encoded text

Reply via email to