I have an xml document (utf encoded) with german character (e.g: � )
 
In the eg. below the Umlaut character ' � ' is UTF encoded.
The XML file looks like this ..
 
<?xml version="1.0" encoding="UTF-8"?>
<!-- This came from sample poll servlet -->
<!DOCTYPE X >
<X Attrib1="Attrib1Info" Attrib2="Attrib2Data" Attrib3="Attrib3Info" >
<CD>
<C attrib1="testattrib1" attrib2="testattrib2" >UTF Encoded Umlaut character
ä</C>
</CD>
</X>
 
 
When I parse the document with Xerces (SAX) I see that the parser does not
return the character � in the characters(char ch[], int start, int length)
callback method.
What I expect to receive in the characters array is "UTF Encoded Umlaut
character �" in more than one chunks or one long chunk. Instead I get the
char's exactly as they appear in the xml doc :
"UTF Encoded Umlaut character ä". 
 
Why is the parser not able to return me the correct unicode characters when
all parsers are supposed to support UTF-8 encoding?
 
 
If instead of the UTF code for � I have &#228; (escape it with the character
reference) then the parser is able to recognize and returns the correct
string in two chunks
char array chunk 1: UTF Encoded Umlaut character 
char array chunk 2: �
 
When I used IE or other xml viewers to view the xml they correctly
interpreted UTF encoding and display the xml with german characters.
 
Is there a bug in Xerces SAX or am I missing something?
 
Thanks
Ashish
 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to