RE: DOMString and international characters

Arnold, Curt Mon, 21 May 2001 15:26:24 -0700
>Debra Kelly wrote:
> >         Wegen_�berf�llung_geschlossen.doc
> > as the command line parm.

Hopefully, I can explain what I think is the problem a little 
clearer but you may have to dig a little deeper to know
exactly what is going on on your platform.

The � and � in your example have point values 0xDC and 0xFC in
ISO-8859-1 (aka Latin Alphabet Number 1).  

When the DOMString::operator=(const char*) initializes the
DOMString, it converts those values to the corresponding 
Unicode character points.  However, to make that translation
it has to know (or better guess) what "code page" was used to
encode the characters.  For example, the value 0xDC means
different things in different code pages.  It would probably 
be some arabic character in ISO-8859-6.

What most probably is happening is that the Xerces-C's guess
of a default code page for your platform is UTF-8 which is a
multi-byte encoding schema.  Unicode code points less than 128
directly translate into a single byte, however byte values 
between 128 and 255 represent a part of a multibyte sequence
and 0xDC 0x62 (displayed as �b in ISO-8859-1) and 0xFC 0x6C (�l)
are not legal sequences in UTF-8.

To work around the issue, you need to "hint" to the transcoder
that you want to use ISO-8859-1 as your default code page transcoder.
Don't know how to do that on your platform, you might check the
documentation for ICONV (which is probably going to be used)

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
RE: DOMString and international characters

Reply via email to