RE: UTF-8 encoding question

neilg 3 Dec 2002 16:50:23 -0000

Hi Benson,

In XML 1.0, the point of being able to use character references (such as
&#210;) is so that the Unicode characters to which they correspond can be
referenced even in documents whose encoding doesn't permit the expression
of those Unicode values.  When the parser sees such a reference, it is
indeed supposed to include the character referenced "as if" it had been
what were originally in the document; but that doesn't mean the parser has
to transcode that unicode value back to the document's encoding, find out
there's no mapping and fail...  Remember, XML processing is always done as
if the document had been presented to the parser in Unicode.


Hope that helps,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]




|---------+---------------------------->
|         |           "Benson Cheng"   |
|         |           <[EMAIL PROTECTED]|
|         |           core.net>        |
|         |                            |
|         |           12/03/2002 11:41 |
|         |           AM               |
|         |           Please respond to|
|         |           xerces-j-user    |
|         |                            |
|---------+---------------------------->
  
>---------------------------------------------------------------------------------------------------------------------------------------------|
  |                                                                             
                                                                |
  |       To:       <[EMAIL PROTECTED]>                                         
                                                     |
  |       cc:                                                                   
                                                                |
  |       Subject:  RE: UTF-8 encoding question                                 
                                                                |
  |                                                                             
                                                                |
  |                                                                             
                                                                |
  
>---------------------------------------------------------------------------------------------------------------------------------------------|



Thanks for your info.

How about the second question, if I escaped the international character
(0xD2) with &#210;, isn't Xerces should report the same error?  but it
doesn't.

<FreeFormText>POSTBOKS 60 SKÒYEN</FreeFormText>

<FreeFormText>POSTBOKS 60 SK&#210;YEN</FreeFormText>

Thanks,
Benson.

-----Original Message-----
From: Andy Clark [mailto:[EMAIL PROTECTED]
Sent: Monday, December 02, 2002 11:49 PM
To: [EMAIL PROTECTED]
Subject: Re: UTF-8 encoding question


Benson Cheng wrote:
> Thanks for the info, the xerces 2.2.1 did report error
(java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence)
on the following line.
>
> <FreeFormText>POSTBOKS 60 SKÒYEN</FreeFormText>

You get this error when you use a character in your
document but incorrectly specify the file encoding.
The first line of the XML document (called the
XMLDecl) specifies the encoding of the file. For
example:

   <?xml version='1.0' encoding='ISO-8869-1'?>

If this line is missing, then the default encoding
is UTF-8. However, if you've created your document
with a text editor like Notepad, it will save the
file with the default encoding of the system --
usually Cp1252 (aka Windows-1252).

However, be aware that simply adding an XMLDecl
line to your file does *not* change the encoding.
To do that, the program that creates the file MUST
save the contents in the appropriate encoding. In
Notepad under Win2K or XP, there is an encoding
selection on the Save dialog that allows you to
select various Unicode encodings like "UTF-8".

Hope this helps...

--
Andy Clark * [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: UTF-8 encoding question

Reply via email to