Re: pruning - current status?

Joseph Kesselman Mon, 13 Jan 2003 07:43:02 -0800

On Saturday, 01/11/2003 at 11:16 CET, Christian Haul 
<[EMAIL PROTECTED]> wrote:
> On 11.Jan.2003 -- 04:48 PM, Joseph Kesselman wrote:
> > Current status: We are *not* currently applying pruning to source
> > documents. Logic to analyse a stylesheet, find opportunities for 
pruning,
> > and apply those insights is possible but nontrivial, and we simply 
haven't
> > had time to pursue it.
> thank you for your fast answer. Actually, I'd like to work on that but 
before
> I'd like to check existing efforts. I'm most interested in the first 
point.
> So, are there any pointers you could provide that don't show up when 
searching
> the list for "pruning"?


There hasn't been a lot of discussion beyond the "it would be a good idea 
if we can do it" level.

> Which would be the best CVS branch to work with? HEAD?

I've been out of the office for a month, but my understanding is that we 
currently have two to three branches active -- main line development 
(HEAD), XSLT 2.0 prep/development ("xslt20"), and -- if I remember 
correctly -- the xsltc folks had also forked off a branch for some of 
their own early development, though that may have been folded back in by 
now. (At some point, of course, all the changes have to be reconciled.)

The xslt20 branch includes some significant changes to DTM in order to 
carry additional type information, since 2.0 will be schema-type 
sensitive, as well as some other generalizations to the XSLT data model 
(for example, much of the distinction between nodesets and temporary trees 
goes away in 2.0), So if you're planning on significantly changing DTM 
behaviors, you might want to start with that branch;. If you want to start 
by working on the stylesheet analysis code and just invoke the existing 
tail-pruning logic (or try a filtering approach, or wait until you have 
some analysis results before you start trying to implement the actual 
filter/prune operations) you ought to be able to start with HEAD with 
reasonable confidence that we'll be able to merge your work when the time 
comes. 


Brief brain-dump of pruning-related issues:


Most current implementations of DTM (those based on DTMDefaultBase) were 
explicitly designed to be a write-once-read-many data model, constructed 
in document order (though possibly incrementally) and never altered 
thereafter, and some of our traversal code made simplifying assumptions 
based on the resulting sequential numbering. 

Tail-pruning (removing the nodes most recently added) wasn't too hard to 
fit to that model, since very few of the nodes actually have to be altered 
-- basically, only the parent and previous sibling immediately preceeding 
the pruned section have to be updated, and the storage occupied by the 
removed nodes can easily be reused by overwriting them as further nodes 
arrive.


Pruning an arbitrary tree out of the center of a DTM, on the other hand, 
is much more difficult. Femoving a subtree by patching its parents and 
siblings is conceptually straightforward... but breaks some of the 
simplifying assumptions I mentioned in the traversal code. You can no 
longer assume that an element is immediately followed by its attributes 
(if any), namespace nodes (if any), and children; the next un-pruned child 
may be considerably farther on in the tables. The traversers and iterators 
would have to be checked to make sure they could handle this.

Also: Since one goal of pruning is not just to simplify the model (though 
that could improve search time) but to actually reduce storage overhead, 
we'd need to figure out how to excise some or all of the removed subtree. 
Normally, to remove rows from a table you copy the later rows down into 
the slots previously occupied by the removed rows... but that could be 
hugely expensive. My proposal was that we take advantage of the fact that 
DTMDefaultBase uses suballocated ("chunked") arrays, and excise only as 
much as could be removed as complete chunks -- so only the high-level 
chunk table would need to be updated, which would be far less expensive. 
The disadvantage would be that we'd have to compute which chunks fell 
entirely within the excised subtree -- and portions of that subtree which 
spilled over into adjacent chunks would be left over as waste space, 
though they might be recovered if/when a higher-level subtree was pruned 
later.

Also: Compaction would involve changing the DTM internal index numbers, 
and external node handles, of nodes following the prune. I beleive most 
local references within DTM are now stored as relative to the current 
node, so a subtree could be moved down without breaking its internal 
connections. But  any references from outside the DTM tree -- stored maps 
from IDs to nodes, or iterators currently in progress, or variables (!?!) 
-- would have to be found and updated (which suggests the prune should be 
done at a time when few of those references exist, if possible -- and, 
obviously, when nothing is looking inside the subtree which will be pruned 
away!).  I hadn't even started sketching that part of the code, back when 
I was analysing this proposal.


It may be that DTM, as it stands, isn't the optimal data model for a 
pruning XSLT processor. This is one of several reasons we're considering 
changing the DTM API  to access the model through cursor/iterator objects 
instead of node handles; that'd give us more freedom to play with 
alternative implementations without paying an overhead for translating 
between numeric references and what ever the native form is. (It should 
also save us some conversion overhead, which ought to be a good thing even 
for the existing DTM implementations... and it might be a bit cleaner 
architecturally.)

______________________________________
Joe Kesselman  / IBM Research

Re: pruning - current status?

Reply via email to