RE: Xalan 2.2.0D6 has great performance gain!

Scott_Boag Sat, 07 Jul 2001 12:00:52 -0700

What Joe said.  I can also ramble on a bit on the theory, intent, and
experience so far...

The intent of the DTM is:

1) Reduce garbage collection and heap churn for large trees.
2) Hide the details of implementation of both storage and iteration in the
tree implementation.  (While the DOM does this to an extent, it requires
the use of Java String objects, and requires node identity as a Java
object.  )
3) Enable a sort of "pull" API for XML trees.
4) Have an API that fits the XPath data model on first principle.
5) Have a tree model that both XSLTC and the interpreted Xalan can use.

By no means is this meant to be an API in competition with the DOM, like
JDOM.  It is a significantly different structure, aimed at a different
space than DOM or JDOM.

The current DTM is composed of the base interfaces, and the "reference"
implementations (SAX2DTM and DOM2DTM).  Eventually we would like to move
these into xml-commons, along with those org.apache.xml.util classes that
it is dependent on.

Important features in the DTM:

1) Extended Type IDs, which should eventually provide much faster type
comparison.  The intent behind these is that they match use for XSLTC.
2) Simple indexing of element types.  This helps mostly with '//' patterns.
This is really a feature of the implementation, and not the API.
3) Built-in iterators and "stateless" Traversers.
4) Incremental build of tree for both SAX events and DOM trees, and if
using Xerces (using the parseSome() feature).
5) Multiple tree management.  (However, we have a fair amount of work to do
with this...)
6) Export to DOM APIs.
7) Use of a XMLString object (along with a caller-defined XMLStringFactory)
to reduce string creation.

Note that the actual size of a tree using the current SAX2DTM is not that
much smaller that the older Stree implementation.  However, we are working
on a more compact implementation, though it may be at some expense to
execution speed.

The downsides of DTM:
1) Our implementation requires what is essentially internal heap
management.  For instance, de-allocation of subtrees if you know you are
done with them, which we have long wanted to do, becomes much more
complicated.  Also, because growing the arrays is quite an issue, we use
the SuballocatedIntVector object instead of a straight int array.
2) Limits of tree size. Currently a DTM tree is limited to 1,048,575 nodes,
and the number of trees you can have per manager is 4095.  This is because
the node handle carries both node identity and tree identity.
3) Simple access of values per node is slightly more expensive: to go to a
next sibling, the tree identity has to be masked off, an array access (via
SuballocatedIntVector ) has to take place, and the tree identity has to be
added back on the resulting value.  Compare this with simply getting a
nextSibling field from the node in Stree.
4) The API has a certain complexity to it.  I less worried about callers
than I am about implementors.  For instance, John Gentilin has been working
on a JDBC implementation of the DTM for the SQL extension, and I think has
found the task more than trivial.  We might make this better over time.

So, you can see, it is a bit of a mixed bag.  I would like to say it is the
perfect solution, but it isn't.  On the other hand, it seems to be so far
an overall win.  My biggest worry right now is what we're going to do about
subtree deallocation (and reuse) when the time comes.

It is important to note that this is still a work in progress.  We're very
open to further evolution of the design, and new brainstorms.  Also, it is
our hope that we'll eventually have multiple implementations of the DTM.
For instance, we would like a Xerces native DTM, that can take advantange
of Xerces native data structures.

[I want to make sure I give proper credit to the XSLTC DOMImpl.  The
current DTM is really a combination of our experience with the XalanJ1 DTM,
and many ideas gleaned from the XSLTC DOMImpl.]

-scott




                                                                                       
                            
                    "Li Liang"                                                         
                            
                    <lliang@first        To:     <[EMAIL PROTECTED]>            
                            
                    rain.com>            cc:     (bcc: Scott Boag/CAM/Lotus)           
                            
                                         Subject:     RE: Xalan 2.2.0D6 has great 
performance gain!                
                    07/06/2001                                                         
                            
                    05:49 PM                                                           
                            
                    Please                                                             
                            
                    respond to                                                         
                            
                    xalan-dev                                                          
                            
                                                                                       
                            
                                                                                       
                            




Hi, Scott,

I am really interested in the implementation of DTM, can you please tell
me where I can find more documents about this? The one on Apache's
website is pretty plain. Or some messages on this list, at least, which
class I can start?

Many thanks.

Li

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Friday, July 06, 2001 12:14 PM
To: [EMAIL PROTECTED]
Subject: Re: Xalan 2.2.0D6 has great performance gain!



> Is this
> because of the usage of DTM?

In some ways it is unfortunate that there are several changes from 2.1.0
to
2.2.0D6, so it's a little hard to tell exactly where the exact
improvement
is.  There is some element indexing in the DTM that might account for
this.
Also, the iterators were rewritten to be far simpler and less buggy
(though
less incremental) for what I call criss-cross patterns like
"//foo/bar"
(i.e. patterns that don't return document order naturally).  Possibly a
combination of the these.

> Is this because some default behavior of Xalan is changed? I saw
> some post about the namespace stuff. Any insight would be appreciated.

Is it possible for you to write a reproduceable test case?  This may or
may
not be a bug, and we need to get to the bottom of it as soon as
possible.

-scott






                    "Li Liang"

                    <lliang@first        To:
<[EMAIL PROTECTED]>
                    rain.com>            cc:     (bcc: Scott
Boag/CAM/Lotus)
                                         Subject:     Xalan 2.2.0D6 has
great performance gain!
                    07/06/2001

                    10:26 AM

                    Please

                    respond to

                    xalan-dev










Hi,

I've tried the 2.2D6 with my xml/xsl pair, here's the result:

2.1        22 Seconds
2.2        4 Seconds

The xml is 13k, xsl has some "//foo/bar". This is HUGE gain. Is this
because of the usage of DTM? If it is, I really like to know some
benchmark numbers about it, and where can I find more details about this
implementation?


Just one thing strange, I have a custom XSLTAdaptor, which will traverse
a DOM tree and fire SAX events to Xalan, it worked perfectly with 2.0,
2.1, but 2.2 output nothing except the declaration "<? xml version=1.0
?>". Is this because some default behavior of Xalan is changed? I saw
some post about the namespace stuff. Any insight would be appreciated.

Keep the good work!

Li
RE: Xalan 2.2.0D6 has great performance gain!

Reply via email to