On Saturday, 01/11/2003 at 11:16 CET, Christian Haul
<[EMAIL PROTECTED]> wrote:
> On 11.Jan.2003 -- 04:48 PM, Joseph Kesselman wrote:
> > Current status: We are *not* currently applying pruning to source
> > documents. Logic to analyse a stylesheet, find opportunities for
pruning,
> > and apply those insights is possible but nontrivial, and we simply
haven't
> > had time to pursue it.
> thank you for your fast answer. Actually, I'd like to work on that but
before
> I'd like to check existing efforts. I'm most interested in the first
point.
> So, are there any pointers you could provide that don't show up when
searching
> the list for "pruning"?
There hasn't been a lot of discussion beyond the "it would be a good idea
if we can do it" level.
> Which would be the best CVS branch to work with? HEAD?
I've been out of the office for a month, but my understanding is that we
currently have two to three branches active -- main line development
(HEAD), XSLT 2.0 prep/development ("xslt20"), and -- if I remember
correctly -- the xsltc folks had also forked off a branch for some of
their own early development, though that may have been folded back in by
now. (At some point, of course, all the changes have to be reconciled.)
The xslt20 branch includes some significant changes to DTM in order to
carry additional type information, since 2.0 will be schema-type
sensitive, as well as some other generalizations to the XSLT data model
(for example, much of the distinction between nodesets and temporary trees
goes away in 2.0), So if you're planning on significantly changing DTM
behaviors, you might want to start with that branch;. If you want to start
by working on the stylesheet analysis code and just invoke the existing
tail-pruning logic (or try a filtering approach, or wait until you have
some analysis results before you start trying to implement the actual
filter/prune operations) you ought to be able to start with HEAD with
reasonable confidence that we'll be able to merge your work when the time
comes.
Brief brain-dump of pruning-related issues:
Most current implementations of DTM (those based on DTMDefaultBase) were
explicitly designed to be a write-once-read-many data model, constructed
in document order (though possibly incrementally) and never altered
thereafter, and some of our traversal code made simplifying assumptions
based on the resulting sequential numbering.
Tail-pruning (removing the nodes most recently added) wasn't too hard to
fit to that model, since very few of the nodes actually have to be altered
-- basically, only the parent and previous sibling immediately preceeding
the pruned section have to be updated, and the storage occupied by the
removed nodes can easily be reused by overwriting them as further nodes
arrive.
Pruning an arbitrary tree out of the center of a DTM, on the other hand,
is much more difficult. Femoving a subtree by patching its parents and
siblings is conceptually straightforward... but breaks some of the
simplifying assumptions I mentioned in the traversal code. You can no
longer assume that an element is immediately followed by its attributes
(if any), namespace nodes (if any), and children; the next un-pruned child
may be considerably farther on in the tables. The traversers and iterators
would have to be checked to make sure they could handle this.
Also: Since one goal of pruning is not just to simplify the model (though
that could improve search time) but to actually reduce storage overhead,
we'd need to figure out how to excise some or all of the removed subtree.
Normally, to remove rows from a table you copy the later rows down into
the slots previously occupied by the removed rows... but that could be
hugely expensive. My proposal was that we take advantage of the fact that
DTMDefaultBase uses suballocated ("chunked") arrays, and excise only as
much as could be removed as complete chunks -- so only the high-level
chunk table would need to be updated, which would be far less expensive.
The disadvantage would be that we'd have to compute which chunks fell
entirely within the excised subtree -- and portions of that subtree which
spilled over into adjacent chunks would be left over as waste space,
though they might be recovered if/when a higher-level subtree was pruned
later.
Also: Compaction would involve changing the DTM internal index numbers,
and external node handles, of nodes following the prune. I beleive most
local references within DTM are now stored as relative to the current
node, so a subtree could be moved down without breaking its internal
connections. But any references from outside the DTM tree -- stored maps
from IDs to nodes, or iterators currently in progress, or variables (!?!)
-- would have to be found and updated (which suggests the prune should be
done at a time when few of those references exist, if possible -- and,
obviously, when nothing is looking inside the subtree which will be pruned
away!). I hadn't even started sketching that part of the code, back when
I was analysing this proposal.
It may be that DTM, as it stands, isn't the optimal data model for a
pruning XSLT processor. This is one of several reasons we're considering
changing the DTM API to access the model through cursor/iterator objects
instead of node handles; that'd give us more freedom to play with
alternative implementations without paying an overhead for translating
between numeric references and what ever the native form is. (It should
also save us some conversion overhead, which ought to be a good thing even
for the existing DTM implementations... and it might be a bit cleaner
architecturally.)
______________________________________
Joe Kesselman / IBM Research