RE: utf-8 characters problem

Christopher Ebert 1 Mar 2005 01:10:47 -0000

The 'square' character is probably because the display character set for
your JEditPanel doesn't have a character for unicode 0x2062. The
character is there since you can encode explicitly to ISO-8859-1 and see
it. The 0x002b being encoded as '+' is correct: it's only going to be
encoded as an escape if you choose a character set that does not include
the standard '+' symbol (I don't think there is one that encodes the
rest of the ASCII characters and not '+'.) It's the output serialization
I'd check first: somewhere the JEditPanel will have an encoding set for
it (probably the default) and it's likely to be something like Cp1252 if
you're on a Windows Box.
The escapes and the characters themselves should be viewed as completely
equivalent in an encoding that supports the character. Having an
application that depends on one or the other is a sign of a bad design,
or at least one that will give you a great deal of trouble.

Chris

-----Original Message-----
From: Kahovec, Jakub [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 28, 2005 15:21
To: [EMAIL PROTECTED]
Subject: RE: utf-8 characters problem

It produdes Xerces 2.6.2 (LSParser, LSSerializer and XMLSerializer). 
I've been using xerces parser and serializer in my java authoring tool
to load and save documents. I've found out the problem with encoding
when I loaded and displayed the xml document (with char. ref. form
chars) in the jeditpanel component. Instead of &#x002b; and &#x2062; I
saw '+' and 'square-liked character. I tried to serialized xml document
to console as well as to file, load document via InputStream or Reader
input with LSInput but I never got results where would be chars sequence
in origin form. 
Only when I explicitly set encoding in LSInput to (ISO-8859-1)and loaded
it via InputStream then the chars sequence &#x2062; kept in the same
form but the sequence &#x002b; was changed to '+' character anyway.
Then I tried to debug structure of DOM document (in Eclipse 3.1) but saw
the same results (+ char and square char, probably it's only problem of
showing utf-8 chars in eclipse.) So to be honest I don't know now, how
to find out, where is the problem, whether is it during parsing,
serializing or displaying data. I'm not so experienced in encodings as
well as in charsets but as far as I know java treat internaly with chars
in UTF-16 charset, could be it the a part of the problem ? I don't
really know.

Thanks for any ideas.

Jakub

-----Original Message-----
From: Bob Foster [mailto:[EMAIL PROTECTED]
Sent: Mon 2/28/2005 10:36 PM
To: [EMAIL PROTECTED]
Subject: Re: utf-8 characters problem

Exactly what Xerces or standard API is producing this result? Are you
sure you're not looking at the result in some editor (that is using the
wrong code page to represent your characters)?

XML parsers deliver characters in Unicode. You are apparently trying to
use the characters as though each character had eight bits.

Tell us a little more about what steps you took to see what you describe
and maybe someone will be able to help.

Bob Foster

Jakub Kahovec wrote:
> Hi,
> when I parse the xml document (with xerces 2.6.2) which has in xml 
> declaration specified utf-8 encoding and which contains utf-8 
> characters in character reference form &#xxxx; the parser replaces 
> these characters  with ascii characters. For some characters is ok but

> for instance InvisibleTimes change for some incorrect strange 
> character sentese.
> I'd like to know if is possible to prohibit changing characters from 
> char. ref. form ? Or does it exist some recommendation how to treat 
> with these characters.
> 
> Here is a piece of my 'problematic' xml document
> 
> <?xml version="1.0" encoding="UTF-8"?> <mathDoc>
> 
> <p>Factorise the following quadratic expression:
>        <math>
>          <mrow>
>            <msup>
>              <mrow>
>            <mi>x</mi>
>              </mrow>
>              <mrow>
>            <mn>2</mn>
>              </mrow>
>            </msup>
>            <mo>&#x002b;</mo> <!-- replaces with character + -->
>            <mi>p</mi>
>            <mo>&#x2062;</mo>   <!-- here is InvisibleTimes -->
>                    <mi>x</mi>
>            <mo>&#x002b;</mo>  <!-- replaces with character + -->
>            <mi>q</mi>
>          </mrow>
>        </math>
> 
> </mathDoc>
> 
> Thanks so much
> 
> Jakub

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: utf-8 characters problem

Reply via email to