Re: [xml] redicting parts of trees

Daniel Veillard Sat, 14 May 2005 01:57:28 -0700

On Fri, May 13, 2005 at 11:15:08PM +0200, Martijn Faassen wrote:
> Hi there,
> 
> libxml2 documents use dictionaries to hash commonly used strings in XML 
> documents, for reasons of space and time efficiency.
> 
> Unfortunately this makes it impossible to safely move a node from one 
> tree to another.
> 
> In the process of writing lxml, an alternative Python wrapper for 
> libxml2 that among other things aims at automatic memory management, 
> I've so far worked around this issue by sharing dictionaries between all 
> trees, whether they be created through reading in XMl, or created from 
> scratch.
> 
> This works, though with the disadvantage of an endlessly growing 
> dictionary as more and more trees are read in, and potential threading 
> issues.
> 
> I recently discovered another disadvantage: if I understand it 
> correctly, XSLT processing, in xsltNewTransformContext, creates a new 
> dictionary in its context, with the style's signature as the subdictionary.


  Up to there I agree, but I don't understand "style's signature".
If you meant the "style's dictionnary", then yes.

> The style's signature is based on the style document's signature.
> 
> The style document's signature in lxml is of course the single shared 
> dictionary, but unfortunately it seems to be unavoidable (except by 
> rewriting xsltNewTransformContext itself) to use the shared dictionary 
> for the result of XSLT processing. This means that nodes from the tree 
> resulting from the XSLT processing cannot be safely moved to other trees 
> which do use the single shared dictionary.

  The reason is that libxslt makes the garantee that a compiled stylesheet
is read-only when used to make a transformation. It also avoid problems
of shared resources in multi-threaded XSLT engines.

> Would the developers be open to me suggesting changes to the XSLT 
> codebase to make this possible again? I suppose I should ask on the XSLT 

  Yes that should be doable. I'm not sure what would be the best API for this.

> list, so let's move on to the real purpose of this mail.
> 
> Exploring these issues made me conclude that it's time to at least look 
> at the alternative to sharing a single global dictionary, redicting 
> parts of trees. A redicting operation would take place whenever a node 
> is moved into a new tree. All the strings in the subtree below this node 
> will be traced to the originating document's dictionary, and the entries 
> will be copied into the target document's dictionary. Additionally, all 
> string references in the subtree will be made to point to the new 
> document's dictionary.

  Yes, it seems that at the DOM level this operation is called an import
based on some PHP/javascript examples I saw recently. I think that if we
add this then we should try to match the existing semantic of those
operation in PHP for example. The thing which need to be checked
when preparing for such an import are:
    - doc remapping
    - dictionnary remapping
    - namespace references to the original document
    - namespace remapping to the local document
    - entities reference to the original document
 I think those are the only pointers which are added to the pure tree
oriented parent/child/sibling ones.
  Looking at the import implementation of PHP5 might give us an idea
of how to implement this. Note that there are incomplete APIs
dealing either just with document pointers (xmlSetTreeDoc) or just
namespaces (xmlNewReconciliedNs and xmlReconciliateNs).

> Redicting is a potentially expensive operation, but I think that for 
> lxml's purposes, it's may be worth it to incur this cost. I also have 
> hopes that there are still ways to optimize this, so that redicting 
> doesn't have to happen in all cases -- after all, dictionary sharing 
> works and pools of documents might end up sharing a single dictionary.

  yes, agreed. A lot of heuristics can be used to avoid some of the costs
if used in a systematic fashion.

> In order to write a good redicting operation, I'd need a bit more 
> information about which information in a tree exactly can end up in a 
> dictionary. If someone would be able to give me a list of what ends up 
> in the dictionary, that would be extremely helpful.

  All markup names, all namespaces strings (prefix and namespace names)
and some text node content (so that all "formatting nodes" used to indent
share as much as possible, or very short text nodes for example "0" or "1").

  Hope it helps, and thanks !

Daniel

P.S.: I think I should be able to design a method to make importing strings
      from a given dictionnary into python strings quite faster for repeatedly
      querying the same set of strings. The principle would be to add an
      API to the dictionnary returning an index for the string (cost O(1))
      and at the python binding level have an array keeping pointers to
      the strings already converted (Py_INCREF'ed of course).

-- 
Daniel Veillard      | Red Hat Desktop team http://redhat.com/
[EMAIL PROTECTED]  | libxml GNOME XML XSLT toolkit  http://xmlsoft.org/
http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/
_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
[email protected]
http://mail.gnome.org/mailman/listinfo/xml

Re: [xml] redicting parts of trees

Reply via email to