Hi John, Method (1) is completely OK if you provide information that this is UTF-8 encoding wich is oftent used by default. But in this case > <abc>[EMAIL PROTECTED]@^$#</abc> > (here the element value is raw utf8) Bytes inside <abc> tag will be converted to Unicode as with call to `new String(bytes, "UTF-8")`, where bytes contain bytes found inside of <abc>.
Method (2): I didn't really understand what you wnated to do with that, but &entity; structure is interpreted as code of character in Unicode. So sequence of åž¾ will be interpreted as Unicode characters. So instead of one UTF-8 character with code 0xE59EBE you will get 3 Unicode characters with codes 0xE5, 0x9E and 0xBE. Take a look at XML specification: <spec href="http://www.w3.org/TR/REC-xml#sec-references"> [Definition: A character reference refers to a specific character in the ISO/IEC 10646 character set, for example one not directly accessible from available input devices.] </spec> So if you know Unicode code you may specify it in &x####; or you can use UTF-8 "raw" characters, but ensure parser knows document is in UTF-8. In any case characters will be in Unicode by the end of parsing, since Java's "char" and "String" work only with Unicode. Thanks, Dmitry -----Original Message----- From: Colosi, John [mailto:[EMAIL PROTECTED] Sent: Thursday, November 08, 2001 12:46 To: 'Voytenko, Dimitry' Cc: '[EMAIL PROTECTED]' Subject: RE: Outfit question Hi Dimitry, I'm still a little confused. I understand the reasoning behind the conversion to Unicode for the Java string. But I'm not seeing the same conversion when I input utf-8 using method (2) from below. Are you saying that my application should not support input using method (1) from below. Should I not allow users to input Raw utf-8 into the XML doc? thanks again, -- John -----Original Message----- From: Voytenko, Dimitry [mailto:[EMAIL PROTECTED] Sent: Thursday, November 08, 2001 3:28 PM To: '[EMAIL PROTECTED]' Subject: RE: Utf8 question Hi John, > <abc>[EMAIL PROTECTED]@^$#</abc> > (here the element value is raw utf8) According to DOM interfaces values of text nodes are represented by String (org.w3c.dom.Text.getNodeValue() returns String). Since String internally is array of char and Java's char is always in Unicode, any characters will be converted to Unicode while bulding DOM. According to SAX interfaces DocumentHandler.characters, ContentHandler.characters, etc have array of chars (char[]) as a first parameter. So you characters will be converted to Unicode again. So I'm afraid you won't be able to leave UTF-8 or other characters, because in this case you'll need to operate with byte[] arrays, which are not supported by any XML interface. Thanks, Dmitry -----Original Message----- From: Colosi, John [mailto:[EMAIL PROTECTED] Sent: Thursday, November 08, 2001 05:50 To: '[EMAIL PROTECTED]' Subject: RE: Utf8 question Thanks for the response Andy. I'm writing an application which requires a utf8 value. I think this value can be input in two ways: 1) <abc>[EMAIL PROTECTED]@^$#</abc> (here the element value is raw utf8) or 2) <abc>åž¾</abc> (here the element value is utf8 written using the hex notation. In the first example, the parser is modifying the utf-8 and returning to me a Java string containing utf-16. In the second example, the Java string I get is just the exact binary that I entered (because the parser makes no assumption about the binary data). So how can my application know whether it's looking at utf-8 or utf-16 because it can't really know how the parser handled the input? Any help is appreciated. thanks, -- John -----Original Message----- From: Andy Clark [mailto:[EMAIL PROTECTED] Sent: Thursday, November 08, 2001 12:32 AM To: [EMAIL PROTECTED] Subject: Re: Utf8 question "Colosi, John" wrote: > It looks like the Xerces parser is converting incoming UTF-8 to > UTF-16 automatically during the parse. Since Java uses UTF16 internally, wouldn't this be what it's supposed to do? Or maybe I'm not understanding what you mean. Please provide some more detailed information. -- Andy Clark * IBM, TRL - Japan * [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
