[jira] Resolved: (XERCESJ-957) Encoding problem; parsed DOM contains incorrect UTF-16 characters

jira Sat, 08 May 2004 10:00:09 -0700

Message:

   The following issue has been resolved as WON'T FIX.


   Resolver: Michael Glavassevich
       Date: Sat, 8 May 2004 10:00 AM

The byte 0x93 in ISO-8859-1 maps to code point U+0093 (set transmit state). I'm 
guessing you meant to label your document Windows-1252 which defers from ISO-8859-1 
[1]. Windows-1252 maps byte 0x93 to the left double quote character. If this is what 
you meant, you need to change your document's encoding declaration.

As for character references, their replacement text is the character itself [2][3]. 
They are a syntactic device for including any of the legal XML characters in the 
document and have nothing to do with the document's encoding. In fact, they are 
particularly useful for including code points in a document which may be out of band 
in a given character encoding.

[1] http://en.wikipedia.org/wiki/ISO_8859-1
[2] http://www.w3.org/TR/2004/REC-xml-20040204/#NT-CharRef
[3] http://www.w3.org/TR/2004/REC-xml-20040204/#entproc
---------------------------------------------------------------------
View the issue:
  http://issues.apache.org/jira/browse/XERCESJ-957

Here is an overview of the issue:
---------------------------------------------------------------------
        Key: XERCESJ-957
    Summary: Encoding problem; parsed DOM contains incorrect UTF-16 characters
       Type: Bug

     Status: Resolved
   Priority: Major
 Resolution: WON'T FIX

    Project: Xerces2-J
   Versions:
             2.6.2

   Assignee: 
   Reporter: F. Andy Seidl

    Created: Sat, 8 May 2004 8:30 AM
    Updated: Sat, 8 May 2004 10:00 AM
Environment: JDK 1.4.2_03 on both Windows XP and Linux

Description:
After parsing a source XML document that uses an encoding other than UTF-8, the 
resulting DOM incorrectly contains non-UTF-16 characters from the original source XML 
document.  The results of parsing the following document into a DOM suggests the 
DOMParser is not translating characters from the source encoding to UTF-16.

<?xml version="1.0" encoding="ISO-8859-1"?>
<Example>
        <Text>"A"</Text>
        <Text>&#x93;B&#x94;</Text>
        <Text>&#x201C;C&#x201D;</Text>
</Example>

All the resulting DOM strings contain the same character values ase the source 
document. For example, the first Text string begins with the character value 0x93, 
which is the left double quote character in the ISO-8859-1 character set.  In UTF-16, 
0x93 is a "set transmit state" control character.

The third Text element begins with the charcter 0x201C which is the UTF-16 left double 
quote character, but which is also not even a valid ISO-8895-1 character.  The fact 
that this character is transfered unchanged to the DOM further suggests that no 
translation from the source character set to UTF-16 is begin performed.


---------------------------------------------------------------------
JIRA INFORMATION:
This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa

If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (XERCESJ-957) Encoding problem; parsed DOM contains incorrect UTF-16 characters

Reply via email to