RE: XMLCh & wchar_t conversion on multiple platforms

Dean Roddey Mon, 14 May 2001 16:31:12 -0700
"When one of my DOMString is initialized from an XMLCh*,
the XMLCh* is analyzed to determine the appropriate internal
representation for that particular DOMString.
If the XMLCh* only contains code points <= 255, then the internal 
representation is marked as ISO-8859-1 (or USASCII if it 
all code points are <= 127). If it contains code points > 255, 
then it will choose UTF-8 or UTF-16 depending on relative sizes.  
There are a lot of nasty switch statements within the DOMString
class that direct you to the appropriate implementation of
DOMString::operator+() for example, depending on the internal 
representations of the participating strings.   However, the
ISO-8859-1 implementations are more efficient since they can
directly convert character offsets into byte offsets that
would not be possible with UTF-8."


But the point was that, just because it can store any single byte value,
that doesn't mean it can retain the semantics of those code points. If the
original encoding had code point 135 meaning "paragraph separator", and all
you remember is that the value is 135, how will you convert that to
something else later? What you'll get when you transcode back out again is
whatever 135 means in 8859-1, which could be (I'm too lazy to look it up),
'o' with umlat or something.

You will have to use UTF-8 for everything basically, since that's the only
way you can retain the semantics of the code points being stored, and still
get reasonable compression for most single byte encodings.

Maybe I'm just missing something, but if I understand you, it isn't going to
work with 8859-1.

--------------
Dean Roddey
Software Geek Extraordinaire
Portal, Inc
[EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
RE: XMLCh & wchar_t conversion on multiple platforms

Reply via email to