RE: Outfit question

Voytenko, Dimitry 8 Nov 2001 22:31:35 -0000

Hi John,

Method (1) is completely OK if you provide information that this is UTF-8
encoding wich is oftent used by default. But in this case
> <abc>[EMAIL PROTECTED]@^$#</abc>
>    (here the element value is raw utf8)
Bytes inside <abc> tag will be converted to Unicode as with call to `new
String(bytes, "UTF-8")`, where bytes contain bytes found inside of <abc>.

Method (2): I didn't really understand what you wnated to do with that, but
&entity; structure is interpreted as code of character in Unicode. So
sequence of &#xe5;&#x9e;&#xbe; will be interpreted as Unicode characters. So
instead of one UTF-8 character with code 0xE59EBE you will get 3 Unicode
characters with codes 0xE5, 0x9E and 0xBE. Take a look at XML specification:
<spec href="http://www.w3.org/TR/REC-xml#sec-references";>
[Definition: A character reference refers to a specific character in the
ISO/IEC 10646 character set, for example one not directly accessible from
available input devices.]
</spec>

So if you know Unicode code you may specify it in &x####; or you can use
UTF-8 "raw" characters, but ensure parser knows document is in UTF-8.
In any case characters will be in Unicode by the end of parsing, since
Java's "char" and "String" work only with Unicode.

Thanks,
Dmitry

-----Original Message-----
From: Colosi, John [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 08, 2001 12:46
To: 'Voytenko, Dimitry'
Cc: '[EMAIL PROTECTED]'
Subject: RE: Outfit question

Hi Dimitry,

I'm still a little confused.  I understand the reasoning behind the
conversion to Unicode for the Java string.  But I'm not seeing the same
conversion when I input utf-8 using method (2) from below.  Are you saying
that my application should not support input using method (1) from below.
Should I not allow users to input Raw utf-8 into the XML doc?

thanks again,
-- John

-----Original Message-----
From: Voytenko, Dimitry [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 08, 2001 3:28 PM
To: '[EMAIL PROTECTED]'
Subject: RE: Utf8 question

Hi John,

> <abc>[EMAIL PROTECTED]@^$#</abc>
>    (here the element value is raw utf8)

According to DOM interfaces values of text nodes are represented by String
(org.w3c.dom.Text.getNodeValue() returns String). Since String internally is
array of char and Java's char is always in Unicode, any characters will be
converted to Unicode while bulding DOM.
According to SAX interfaces DocumentHandler.characters,
ContentHandler.characters, etc have array of chars (char[]) as a first
parameter. So you characters will be converted to Unicode again.
So I'm afraid you won't be able to leave UTF-8 or other characters, because
in this case you'll need to operate with byte[] arrays, which are not
supported by any XML interface.

Thanks,
Dmitry

-----Original Message-----
From: Colosi, John [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 08, 2001 05:50
To: '[EMAIL PROTECTED]'
Subject: RE: Utf8 question

Thanks for the response Andy.
I'm writing an application which requires a utf8 value.  I think this value
can be input in two ways:

1)

<abc>[EMAIL PROTECTED]@^$#</abc>
   (here the element value is raw utf8)

or

2)

<abc>&#xe5;&#x9e;&#xbe;</abc>
   (here the element value is utf8 written using the hex notation.

In the first example, the parser is modifying the utf-8 and returning to me
a Java string containing utf-16.  In the second example, the Java string I
get is just the exact binary that I entered (because the parser makes no
assumption about the binary data).

So how can my application know whether it's looking at utf-8 or utf-16
because it can't really know how the parser handled the input?

Any help is appreciated.

thanks,
-- John

-----Original Message-----
From: Andy Clark [mailto:[EMAIL PROTECTED]
Sent: Thursday, November 08, 2001 12:32 AM
To: [EMAIL PROTECTED]
Subject: Re: Utf8 question

"Colosi, John" wrote:
>         It looks like the Xerces parser is converting incoming UTF-8 to
> UTF-16 automatically during the parse.

Since Java uses UTF16 internally, wouldn't this be what
it's supposed to do? Or maybe I'm not understanding what
you mean. Please provide some more detailed information.

-- 
Andy Clark * IBM, TRL - Japan * [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Outfit question

Reply via email to