Re: UTF-8 problem with Xerces-J2

Jeffrey Rodriguez 4 Sep 2003 18:04:21 -0000


Hi,
      I am trying to parse an UTF-8 encoded document (which has lots of
UTF-8 characters) using Xerces SAX parser. I am running this program on Sun


So what does your encodingDecl look like if any in your document?

There is no problem with Xerces J parsing UTF8 data.

Solaris box with JDK 1.3.1_05. I save the data in XML (after parsing) to

What do you mean by "save the data", how? Remember that the parser will get you back the data as "Java" char (aka Unicode, UTF16). Do you transcode the data back into UTF-8 or does Oracle do that?

Oracle Database (which has UTF-8 encoding ). When I try to display the
content in a HTML after retrieving from database, I see some weired
characters. Can any one suggest the reason ?

If you had UTF16 data back from the parser and stored that into a UTF8 I think that would be problematic if you don't convert to UTF8. UTF8 is multibyte , and UTF16 is double byte data. UTF8 and UTF16 from U+0000 to U+007F map to each other (more correctly to said Unicode code point map within that range to UTF-8, therefore form some values if they are store directly into a UTF8 data repository they map map correctly but data outside this range wll not.

I am assuming that the UTF-8 format is supported by Xerces. I have

Good assumption since xml parser must be able to read both UTF-8 and UTF-16 documents.

created a InputSource for the XML file and using it as the parameter for parse method.


Did you use the InputSource and provided an encoding?

   I am using OraclePreparedStatement because the column in which data is
   stored is LONG. Do I need to do anything specific to let Oracle know
   that it is UTF-8 data ?

You said that Oracle stores data as UTF-8??? right. You should this question in a Oracle discussion group just to be sure.

   The encoding is specified properly in Jsp using both JSP Page param &
   HTML meta directive.

Yes, try to see first that your data is correctly store as UTF-8 in Oracle. To test this pick a multibyte with more than one byte like.

The Russian Sheah, in UTF-8 I think it is "000416" that should be the value stored into UTF-8. The value that the parser will give you back is Ud096 ( why? Because Java chars are Unicode...).

Hope this helps,

                             Jeffrey Rodriguez
                             Silicon Valley


Any help is appreciated.

Thanks in advance.

Ravi Varanasi
408 517 7675


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

_________________________________________________________________ Compare Cable, DSL or Satellite plans: As low as $29.95. https://broadband.msn.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: UTF-8 problem with Xerces-J2

Reply via email to