Hi Benson, In XML 1.0, the point of being able to use character references (such as Ò) is so that the Unicode characters to which they correspond can be referenced even in documents whose encoding doesn't permit the expression of those Unicode values. When the parser sees such a reference, it is indeed supposed to include the character referenced "as if" it had been what were originally in the document; but that doesn't mean the parser has to transcode that unicode value back to the document's encoding, find out there's no mapping and fail... Remember, XML processing is always done as if the document had been presented to the parser in Unicode.
Hope that helps, Neil Neil Graham XML Parser Development IBM Toronto Lab Phone: 905-413-3519, T/L 969-3519 E-mail: [EMAIL PROTECTED] |---------+----------------------------> | | "Benson Cheng" | | | <[EMAIL PROTECTED]| | | core.net> | | | | | | 12/03/2002 11:41 | | | AM | | | Please respond to| | | xerces-j-user | | | | |---------+----------------------------> >---------------------------------------------------------------------------------------------------------------------------------------------| | | | To: <[EMAIL PROTECTED]> | | cc: | | Subject: RE: UTF-8 encoding question | | | | | >---------------------------------------------------------------------------------------------------------------------------------------------| Thanks for your info. How about the second question, if I escaped the international character (0xD2) with Ò, isn't Xerces should report the same error? but it doesn't. <FreeFormText>POSTBOKS 60 SKÒYEN</FreeFormText> <FreeFormText>POSTBOKS 60 SKÒYEN</FreeFormText> Thanks, Benson. -----Original Message----- From: Andy Clark [mailto:[EMAIL PROTECTED] Sent: Monday, December 02, 2002 11:49 PM To: [EMAIL PROTECTED] Subject: Re: UTF-8 encoding question Benson Cheng wrote: > Thanks for the info, the xerces 2.2.1 did report error (java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence) on the following line. > > <FreeFormText>POSTBOKS 60 SKÒYEN</FreeFormText> You get this error when you use a character in your document but incorrectly specify the file encoding. The first line of the XML document (called the XMLDecl) specifies the encoding of the file. For example: <?xml version='1.0' encoding='ISO-8869-1'?> If this line is missing, then the default encoding is UTF-8. However, if you've created your document with a text editor like Notepad, it will save the file with the default encoding of the system -- usually Cp1252 (aka Windows-1252). However, be aware that simply adding an XMLDecl line to your file does *not* change the encoding. To do that, the program that creates the file MUST save the contents in the appropriate encoding. In Notepad under Win2K or XP, there is an encoding selection on the Save dialog that allows you to select various Unicode encodings like "UTF-8". Hope this helps... -- Andy Clark * [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]