I don't believe that its reasonable to strictly say that we will put out a
16 bit character. That's following a spec, that wasn't design for this
language, to the point that its counterproductive. As long as the code
points stored are Unicode code points, the size of the character is best
set to that of the native wide character, wchar_t. I'm sure that we will
end up going this way, since it far and away makes the most sense to do,
and greases the most usage paths the greatest amount. Its not at all
acceptable for people whose wide character APIs don't take a 16 bit
character to force them to transcode just to add the leading bytes.

The DOM cannot use a raw string. The DOM has many requirements for
substringing and reference counting, which wouldn't be very practical with
a raw string, and I'm not sure that the standard library strings (even if
there were not other practical encumberances to using them) would
sufficiently meet those needs. The DOM string will give you a pointer to
the raw XMLCh buffer, which lets everyone get to it in the most fundamental
form so that everyone can get it to the desired representation with as
little overhead as possible (on agregate.)

The use of XMLCh as the only output from the core is exactly to increase
performance and allow everyone equal opportunity to put the text into
whatever they want. If we pick something like the standard library string,
that will be more likely to decrease performance and make more people go
through more hoops if that is not the format that they need (and the
overhead of using a more complex representation will be doubly wasteful.)

----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]



[EMAIL PROTECTED] on 12/20/99 01:36:55 PM

Please respond to [EMAIL PROTECTED]

To:   [EMAIL PROTECTED]
cc:
Subject:  RE: PROPOSAL: DOMString




Of course, you do what you need to do for the parser's internal
representation.  But at the API level, where people deal with DOM and SAX,
then we have more constraints.  The W3C DOM recommendation mandates that
DOMString's are strings in UTF-16 encodings.  So, strictly speaking, the
parser's DOM and SAX classes should either be XMLCh* where XMLCh is a
16-bit unsigned integer, or some string class type that allows access to an
underlying XMLCh* type.  You could argue (and I'd tend to agree) that it
would be desirable to have the API's give a string in wchar_t that could be
immediately consumed by the platform.  Unfortunately, the recommendation
doesn't say that.

As for the standard library, I'll say this.  No string class is going to be
perfect for everyone.  But std::basic_string<> is standard, is better
documented, is more efficient  is a lot less quirky than DOMString.
www.stlport.org has an open source implementation of the standard library
classes, ported to  a variety of platforms, including AIX and OS/390.  It
can also be compiled in a mode that doesn't require namespace support.
This should make all those coldwar-era compilers very happy<g>.

Exposing everything as XMLCh* is an interesting idea.  It looks like the
Xerces SAX interface went that route.  But if it is just to avoid having
the parser, the API and the client's code agree on a representation, then
performance is going to suffer.

What would things look like if we use wchar_t * everywhere: internal
implementation, API's and client code (like Xalan/C++)?  We wouldn't be
strictly following the XML DOM recommendation, but the impedance mismatch
would be greatly reduced.  A lot of unnecessary copying around of strings
would be eliminated.  And we wouldn't have one string representation for
DOM and another for SAX, like we do now.

-Rob





Reply via email to