Actually, Oracle doesn't "totally supports UTF-8" unless you specifically alter the database to do that. By default, Oracle installs as USASCII7. Even installed as USASCII7 (x00 - x7f), Oracle will correctly store and deliver 8859-1 (Western European 8) (x00 - xff) because Oracle stores bytes as 8 bit bytes rather than 7 bit bytes. In UTF-8, any character value greater then x7f is represented by two or more bytes.
This is true for Oracle 8x in the United States. I assume it is also true for Oracle 9x. I'd check with the DBA and verify that the Oracle instance has been altered to accept UTF-8. Thanks, Donald Holliday -----Original Message----- From: Ravi Varanasi [mailto:[EMAIL PROTECTED] Sent: Thursday, September 04, 2003 12:31 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: UTF-8 problem with Xerces-J2 Hi Jeffrey, Thanks for the reply. Following are the answers to your questions :- 1) XML doc has encoding defined as UTF-8. This is the first stmt in the XML file : <?xml version="1.0" encoding="UTF-8"?> 2) I am using InputSource with UTF-8 encoding set. Code snippet: InputSource ipSource = new InputSource(); ipSource.setEncoding("UTF-8"); ipSource.setByteStream( new FileInputStream( new File(inputFile) ) ); parser.parse(ipSource); 3) Oracle totally supports UTF-8. I stored some UTF-8 data before ( using SQL scripts) and it worked fine. 4) Comig to the important question, do I convert the data to UTF-8 ? Answer is NO. The apache documentation says that the encoding is "retained" when a ByteStream is passed in to the parse method (as InputSource). So, the char array I get in characters call-back method must have char data encoded in UTF-8. Is it not correct ? Since the parser does not guarantee that entire char data is sent in a single call back metod call, I am constucting a String using the char array. And the String constructor does not take the encoding parameter. Is there any other way I can get String with UTF-8 encoding ? I can not use byte array because if I covert char[] to byte [], there is a good possibility of data loss. Following is the code in my characters call back method. So, in essence, I am assuming that the char[] I get has UTF-8 data. Please suggest if it is not correct ! ! Please note that elementStack is a data structure I am using to store some data. Pl ignore it. ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ------------------- public void characters(char ch[], int start, int length) throws SAXException { String currData = new String(ch, start, length); if (elementStack != null) { XMLElement currElement = (XMLElement) elementStack.peek(); currElement.appendData(currData.trim()); } ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ------------------- Thanks for the help, Ravi Varanasi 408 517 7675 |---------+----------------------------> | | "Jeffrey | | | Rodriguez" | | | <[EMAIL PROTECTED]| | | ail.com> | | | | | | 09/04/2003 11:02 | | | AM | | | Please respond to| | | xerces-j-user | | | | |---------+----------------------------> >--------------------------------------------------------------------------- ------------------------------------------------------------------| | | | To: [EMAIL PROTECTED] | | cc: | | Subject: Re: UTF-8 problem with Xerces-J2 | >--------------------------------------------------------------------------- ------------------------------------------------------------------| > >Hi, > I am trying to parse an UTF-8 encoded document (which has lots of >UTF-8 characters) using Xerces SAX parser. I am running this program on Sun So what does your encodingDecl look like if any in your document? There is no problem with Xerces J parsing UTF8 data. >Solaris box with JDK 1.3.1_05. I save the data in XML (after parsing) to What do you mean by "save the data", how? Remember that the parser will get you back the data as "Java" char (aka Unicode, UTF16). Do you transcode the data back into UTF-8 or does Oracle do that? >Oracle Database (which has UTF-8 encoding ). When I try to display the >content in a HTML after retrieving from database, I see some weired >characters. Can any one suggest the reason ? If you had UTF16 data back from the parser and stored that into a UTF8 I think that would be problematic if you don't convert to UTF8. UTF8 is multibyte , and UTF16 is double byte data. UTF8 and UTF16 from U+0000 to U+007F map to each other (more correctly to said Unicode code point map within that range to UTF-8, therefore form some values if they are store directly into a UTF8 data repository they map map correctly but data outside this range wll not. > > > I am assuming that the UTF-8 format is supported by Xerces. I have Good assumption since xml parser must be able to read both UTF-8 and UTF-16 documents. > created a InputSource for the XML file and using it as the parameter >for > parse method. Did you use the InputSource and provided an encoding? > I am using OraclePreparedStatement because the column in which data is > stored is LONG. Do I need to do anything specific to let Oracle know > that it is UTF-8 data ? You said that Oracle stores data as UTF-8??? right. You should this question in a Oracle discussion group just to be sure. > The encoding is specified properly in Jsp using both JSP Page param & > HTML meta directive. > Yes, try to see first that your data is correctly store as UTF-8 in Oracle. To test this pick a multibyte with more than one byte like. The Russian Sheah, in UTF-8 I think it is "000416" that should be the value stored into UTF-8. The value that the parser will give you back is Ud096 ( why? Because Java chars are Unicode...). Hope this helps, Jeffrey Rodriguez Silicon Valley > >Any help is appreciated. > >Thanks in advance. > >Ravi Varanasi >408 517 7675 > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > _________________________________________________________________ Compare Cable, DSL or Satellite plans: As low as $29.95. https://broadband.msn.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
