I disagree.  By having separate interfaces for strings for DOM and SAX, we
run into all sorts of performance hits.  For example, in Xalan, we have a
DOM that we want to serialize.  We have classes like FormatterToHTML and
FormatterToXML, all derived from SAX's DocumentHandler.  Methods in the
DocumentHandler interface, like startElement and endElement are declared to
take null-terminated XMLCh*.  But if I try to access the DOM's rawBuffer(),
I'm not guaranteed a null-terminated string.  To get one, I must make a
temporary copy of the string.  Sure, you may have saved some time doing
substring someplace, but at what cost?  I'm going to need to create a lot
of temporary strings, leading to increased memory fragmentation, etc.  The
same thing happens if I want to use the util/TextOutputStream, since it
expects null-terminated XMLCh*'s.

Another DOMString "optimization" is demonstrated by this code:

DOMString x(100);

for (int i=0; i<100; i++)
    x.appendData(... calc some characterter..);

Since DOMString doesn't have an appendData() overload for XMLCh, this will
create a temporary DOMString for each character we're appending.  This is
bad enough.  But then I see that the appendData(DOMString) allocates only
enough memory as needed, so looping like this would cause numerous
reallocations unless we pre-allocate the size of the destination DOMString,
as shown here.

But what does this code really do?  The first time through the loop,
appendData() sees that we're appending to an empty DOMString, so it just
swaps the internal implementations, throws away the pre-allocated string,
and continues then to increase the size of the DOMString one character
every time through the loop.  I'm sure this saved someone some time in some
benchmark, but not in my code.

The advantage of a standard string class is that it has been validated
across a wide variety of uses and has reasonable performance everywhere.

As it is now, we have some API's taking DOMString, some null-terminated
XMLCh* and some non-terminated XMLCh* with an additional length param.
This can't be good.  What if I want to input from SAX, build a DOM tree and
then output via TextOutputStream?  Watch the data:  I'll start with XMLCh*,
make a copy of into DOMString's and then make another copy of them to get
null-terminated XMLCh*'s for output.  Even without transcoding, this still
is inefficient.  That Xalan/C++ spends about 40% of its non I/O time in the
DOMString ctor is further evidence.


-Rob

Dean Roddey wrote:

>The DOM cannot use a raw string. The DOM has many requirements for
>substringing and reference counting, which wouldn't be very practical with
>a raw string, and I'm not sure that the standard library strings (even if
>there were not other practical encumberances to using them) would
>sufficiently meet those needs. The DOM string will give you a pointer to
>the raw XMLCh buffer, which lets everyone get to it in the most
fundamental
>form so that everyone can get it to the desired representation with as
>little overhead as possible (on agregate.)


Reply via email to