Hi, I am trying to parse an UTF-8 encoded document (which has lots of UTF-8 characters) using Xerces SAX parser. I am running this program on Sun
So what does your encodingDecl look like if any in your document?
There is no problem with Xerces J parsing UTF8 data.
Solaris box with JDK 1.3.1_05. I save the data in XML (after parsing) to
What do you mean by "save the data", how? Remember that the parser will get you
back the data as "Java" char (aka Unicode, UTF16). Do you transcode the data back into UTF-8
or does Oracle do that?
Oracle Database (which has UTF-8 encoding ). When I try to display the content in a HTML after retrieving from database, I see some weired characters. Can any one suggest the reason ?
If you had UTF16 data back from the parser and stored that into a UTF8 I think that would
be problematic if you don't convert to UTF8.
UTF8 is multibyte , and UTF16 is double byte data. UTF8 and UTF16 from U+0000 to U+007F
map to each other (more correctly to said Unicode code point map within that range to UTF-8,
therefore form some values if they are store directly into a UTF8 data repository they map
map correctly but data outside this range wll not.
I am assuming that the UTF-8 format is supported by Xerces. I have
Good assumption since xml parser must be able to read both UTF-8 and UTF-16 documents.
created a InputSource for the XML file and using it as the parameter for
parse method.
Did you use the InputSource and provided an encoding?
I am using OraclePreparedStatement because the column in which data is stored is LONG. Do I need to do anything specific to let Oracle know that it is UTF-8 data ?
You said that Oracle stores data as UTF-8??? right. You should this question in a Oracle
discussion group just to be sure.
The encoding is specified properly in Jsp using both JSP Page param & HTML meta directive.
Yes, try to see first that your data is correctly store as UTF-8 in Oracle. To test this pick
a multibyte with more than one byte like.
The Russian Sheah, in UTF-8 I think it is "000416" that should be the value stored into UTF-8.
The value that the parser will give you back is Ud096 ( why? Because Java chars are Unicode...).
Hope this helps,
Jeffrey Rodriguez
Silicon Valley
Any help is appreciated.
Thanks in advance.
Ravi Varanasi 408 517 7675
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
_________________________________________________________________
Compare Cable, DSL or Satellite plans: As low as $29.95. https://broadband.msn.com
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
