Of course, you do what you need to do for the parser's internal
representation.  But at the API level, where people deal with DOM and SAX,
then we have more constraints.  The W3C DOM recommendation mandates that
DOMString's are strings in UTF-16 encodings.  So, strictly speaking, the
parser's DOM and SAX classes should either be XMLCh* where XMLCh is a
16-bit unsigned integer, or some string class type that allows access to an
underlying XMLCh* type.  You could argue (and I'd tend to agree) that it
would be desirable to have the API's give a string in wchar_t that could be
immediately consumed by the platform.  Unfortunately, the recommendation
doesn't say that.

As for the standard library, I'll say this.  No string class is going to be
perfect for everyone.  But std::basic_string<> is standard, is better
documented, is more efficient  is a lot less quirky than DOMString.
www.stlport.org has an open source implementation of the standard library
classes, ported to  a variety of platforms, including AIX and OS/390.  It
can also be compiled in a mode that doesn't require namespace support.
This should make all those coldwar-era compilers very happy<g>.

Exposing everything as XMLCh* is an interesting idea.  It looks like the
Xerces SAX interface went that route.  But if it is just to avoid having
the parser, the API and the client's code agree on a representation, then
performance is going to suffer.

What would things look like if we use wchar_t * everywhere: internal
implementation, API's and client code (like Xalan/C++)?  We wouldn't be
strictly following the XML DOM recommendation, but the impedance mismatch
would be greatly reduced.  A lot of unnecessary copying around of strings
would be eliminated.  And we wouldn't have one string representation for
DOM and another for SAX, like we do now.

-Rob


Reply via email to