Of course, you do what you need to do for the parser's internal representation. But at the API level, where people deal with DOM and SAX, then we have more constraints. The W3C DOM recommendation mandates that DOMString's are strings in UTF-16 encodings. So, strictly speaking, the parser's DOM and SAX classes should either be XMLCh* where XMLCh is a 16-bit unsigned integer, or some string class type that allows access to an underlying XMLCh* type. You could argue (and I'd tend to agree) that it would be desirable to have the API's give a string in wchar_t that could be immediately consumed by the platform. Unfortunately, the recommendation doesn't say that.
As for the standard library, I'll say this. No string class is going to be perfect for everyone. But std::basic_string<> is standard, is better documented, is more efficient is a lot less quirky than DOMString. www.stlport.org has an open source implementation of the standard library classes, ported to a variety of platforms, including AIX and OS/390. It can also be compiled in a mode that doesn't require namespace support. This should make all those coldwar-era compilers very happy<g>. Exposing everything as XMLCh* is an interesting idea. It looks like the Xerces SAX interface went that route. But if it is just to avoid having the parser, the API and the client's code agree on a representation, then performance is going to suffer. What would things look like if we use wchar_t * everywhere: internal implementation, API's and client code (like Xalan/C++)? We wouldn't be strictly following the XML DOM recommendation, but the impedance mismatch would be greatly reduced. A lot of unnecessary copying around of strings would be eliminated. And we wouldn't have one string representation for DOM and another for SAX, like we do now. -Rob