RE: DOMString (was: xalan-c Problem with Xerces initialization)

roddey 19 Jan 2000 20:29:56 -0000

The SAX API basically just passes along const pointers to fixed buffers
inside the parser. and it reuses them over and over for every new callout
event. Much speed is gained from doing this. If the parser had to always
reallocate new buffers every time it parsed a new piece of markup, you
would have just moved the burden from one place to the other, but not
reduced it, while introducing extra overhead to do the reference counting.
And it would work this way, since you would have 'adopted' every buffer
passed out, in order to fill up the DOM nodes, so the 'copy on write' would
be triggered every time as the scanner wrote to the string objects to fill
them with new data for the next piece of markup parsed.

So the main problem with this scenario is that you are kind of borrowing
from Peter to pay Paul, plus the interest you are going to owe Peter. And
it would cause the 98% common case usage of SAX to get slower (due to the
need to do reference counting on all buffers internally), for no real
benefit to the users of it.

Maybe I'm missing something in this, but's how it looks to me. I just don't
think it would be a win really?

----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]



[EMAIL PROTECTED] on 01/18/2000 09:11:08 PM

Please respond to [EMAIL PROTECTED]

To:   [EMAIL PROTECTED]
cc:
Subject:  RE: DOMString (was: xalan-c Problem with Xerces initialization)



Here's the flow for Xalan:

We have 2 inputs, an XML document, on which we use DOMParser, and an XSL
stylesheet, on which we use the SAXParser.  In our SAX DocumentHandler, we
build our stylesheet into a special "subclassed" DOM.  We then apply the
templates in the stylesheet DOM, matching patterns in the document DOM.
When we need to realize output, then we send result elements through SAX
again, since we have DocumentHandler subclasses that do XML, HTML and Text
formatting.

When creating the stylesheet's DOM, and when output formatting via SAX, we
need to deal with literal wide-character strings, for things like character
entities and standard XSLT element names.  With DOM, we can use the macro
we've been discussing, to avoid transcoding expect under platforms like
HP-UX, where transcoding via DOMStrings would be used.  But then we need a
solution for pushing string literals through SAX.  It would be really nice
if it was the same solution.  We could use arrays of Unicode characters,
like Xerces does, though that looks painful.

You can also probably see our performance bottleneck.  Everytime we cross
the interface from  SAX to DOM, we need to create a DOMString from XMLCh*
and this means that we need to copy the data.  If we had a single string
class, reference-counted with copy-on-write semantics,  used by both the
DOM and SAX API's, then we could imagine something very nice: many strings
could be copied into a buffer once, at parse time, and stay in that buffer,
as the string's handle was passed from SAX to DOM to SAX and finally
written out.  Most uses of XSLT involve restructuring the data, rearranging
it, taking content and wrapping it in style tags.  So, many or even most of
the strings would benefit if we could avoid this impedance mismatch.

-Rob


Dean Roddey wrote:


>>I guess the meta-question is this: do we want a solution that encompasses

>>SAX and XMLCh*, or just DOM and DOMString? It seems that the differences
>>in the representation of L"foo" is going to effect SAX as well. And Xalan

>>uses both


>I'm not sure that there is a problem here. SAX is just a layer on the
>parser that just spits stuff out. Its always spit out in XMLCh format, and

>was always that way since it was internalized down in the guts of the
>parser.


>Or, are you saying that you guys are also spitting stuff out a SAX
>interface?
RE: DOMString (was: xalan-c Problem with Xerces initialization)

Reply via email to