Actually, XMLCh cannot be always a 16 bit unsigned. It needs to float to
wchar_t, which is what is required to allow it to go straight to the wide
char APIs on the local platform. This might mean that its 32 bits in size.
This has not been done so far, due to a failure on my part to 'splain what
I intended, but it should be fixed in hopefully the next non-revision
release.

For some folks, a DOM using 32 bit chars might not be acceptable. If so,
the DOM might have to internalize characters in a different form from
XMLCh. XMLCh is the representation for the core of the parser system and
should float to wchar_t. If the DOM needs to store in different sizes,
that's it business really, but it needs to use its own floatable char
typedef, not XMLCh. But, for many people, this will still be a big burden
because they will still want to pass the raw buffer to local wide character
APIs in a large number of cases, so they are stuck transcoding everything
out of the DOM before they can use it in the local wide character APIs
which are then probably going to transcode it again.

As for using standard library services, this is a bit sticky.
Unfortunately, there are a lot of big customers out there who use compilers
that won't support namespaces and probably have no modern standard library
stuff available. Flexibility might dictate staying with our own DOM string
representation, and it certainly argues that all that comes out of the core
is raw XMLCh arrays.

Even on those platforms which have standard library services, that might
not be the form that they want it in. So we would pay the price of using
objects, force the inclusion of standard library stuff that people might
not want or have, and many people would still just pull out the raw buffer
anyway. So, to me, the shortest aggregate distance to all desirable
destinations is to provide data in the XMLCh form from the core of the
system. DOM has different needs of course and its string representation is
debateable. But the issues of bringing in standard library stuff still
argues against using that stuff for the DOM string representation.

----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]



[EMAIL PROTECTED] on 12/20/99 12:03:08 PM

Please respond to [EMAIL PROTECTED]

To:   [EMAIL PROTECTED]
cc:
Subject:  RE: PROPOSAL: DOMString


As I see it, there are two reasons why you might need to transcode.

First, you might need it to access some particular algorithm you need.  You
have some nice tokenizer class, or a regular expression class that takes
char *.  Instead of rewriting the class to take XMLCh*, you transcode,
process and perhaps convert the result back.

The second scenario, is when you are transcoding for output, to display a
string to the user, or write it to a file and need to access a platform
function that assumes ASCII, or some other encoding

As I see it, the C++ standard library deals with the first issue well.  The
char traits classes and the templated std::basic_string class make it
possible to deal with strings abstractly.  Searching, sorting, etc. work
the same, whether your XMLCh is a 8-bit signed char, or a 64-bit unsigned
long.  Writing good, char size independent algorithms is possible and
simple.

The second issue is more complex.  When it comes time to deal with the
issues of encodings, etc. you just have to bite the bullet and do it.

So, while an algorithm may be able to be designed to be independent of a
particular character representation, a program can't escape it for I/O.  My
proposal was to replace DOMString with basic_string<XMLCh> with a possibly
conditional definition of XMLCh.  But I'd be happy if we just used
std::basic_string<XMLCh> where XMLCh was always a 16-bit unsigned integer,
like it is today.  This would allow the use of generic string algorithms,
in the style of the standard library.


Reply via email to