In regards to the new DOM, "Curt Arnold" <[EMAIL PROTECTED]> wrote

> 2.  Align DOM_String with STL's basic_string<XMLCh> semantics
>
> Since Andy said that his DOM will break existing app's, might as well
take
> advantage of the opportunity to change DOMString too.  Basically, this
says
> that, at least, any method exposed in the public interface to DOMString
> should have the same signature and work identically to a method in the
> basic_string template.  For performance reasons, the DOMString
> implementation could be different, though I would love to find a way to
> typedef DOMString as basic_string<XMLCh> while still maintaining the
> optimizations of the current DOMString internally.  That might involve
using
> a distinct internal string class within the DOM implementation.

I believe that plain vanilla (XMLCh *) pointers to null-terminated strings
are the best way to go.  They are simple, fast, as small as you can get
with a utf-16 based format, and mesh cleanly with the parser, which also
uses this format.  Using basic_string<XMLCh> as the string type within the
DOM internal structure itself would introduce overhead at a point where
every byte and machine cycle counts.

basic_string<type> defines constructors and many methods that take
pointers to null terminated strings of <type>.  So, although I'm no expert
with STL strings, it would seem that there would be reasonable
interoperability between XMLCh * and basic_string<XMLCh>, certainly much
better than with the current DOMString.  And some sort of a read-only
basic_string could be done that just referenced the const XMLCh * pointers
returned everywhere by the new DOM, which would avoid a lot of copying of
data and extra storage allocations.

Within the DOM, utf-8 vs utf-16 strings makes another interesting
question.  For data that is consists of mostly Latin characters, utf-8
strings are about half the size.  And for the DOM, with big documents
being 100% memory resident, that would be a significant saving.  The W3C
DOM API does call for utf-16, but that could probably be hand-waved away -
all of the data is there, the character access and substring functions in
the DOM API could work correctly (in terms of utf-16 units, as specified
by W3C), and all of the necessary information is present.  If we were
starting everything from scratch, both the parser and the DOM, I would
lobby for utf-8 everywhere.  But I don't think that the DOM should deviate
from what the scanner is doing, which is to say, I think that we should
stick with utf-16 strings.


Andy Heninger
IBM, Cupertino, CA
[EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to