Re: DOM 3 Patches?

Tinny Ng Tue, 16 Apr 2002 05:23:02 -0700

Lenny,

Yes I am reviewing that together as well.


I was stuck in the memory management and then side-track by many other stuff
in the last month and thus didn't carry on the investigation since my last
post.    Since we couldn't start our DOM L3 development until we have
resolved this issue, I must dedicate myself to look into this first in the
next couple of weeks.   I will post in the mailing list once I have a better
idea.  Thanks!

Tinny

----- Original Message -----
From: "Lenny Hoffman" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, April 15, 2002 4:01 PM
Subject: RE: DOM 3 Patches?


Hi Tinny,

How is your review of the DOM-IDOM integration going?  Remember that has an
impact on the decision to standardize on IDOM as the Xerces implementation.
Also, when you last posted that you were considering on standardizing on
IDOM there was quite a bit of discussion regarding the danger of going to a
fixed memory model that I don't remember you commenting on.  There are also
open issues regarding serious memory leaks with the current IDOM that have
not been addressed; specifically that IDOM does not release any allocated
memory until the owning document is deleted, which leads to unlimited growth
when performing common operations like changing attribute values and adding
and removing elements.

I have been working on a write up that describes my view of the Xerces DOM,
it is not complete yet, but since I haven't heard from you on your position,
I have included it below so that you and anyone else interested can comment.

------------------------------------------------------------

Xerces DOM Redesign

Background

The W3C has a recommendation for a standard DOM, but they did not provided a
recommendation for how the C++ language should bind to the DOM like they did
for Java.  Thus, C++ bindings are free to provide any type of interface they
see fit.  The first approach taken by the Xerces project was to emulate the
Java binding, which offered several benefits:

� Those familiar with the Java DOM binding would find it easy to learn and
use the C++ DOM.
� Memory management is hidden from C++ DOM users, just as it is for Java DOM
users.

The solution chosen for the memory management problem was to utilize the
handle/body pattern and use reference counting to know when a node body is
no longer needed.  A node body is no longer needed when:

1. No more handles are pointing to it.
2. It has no parent node.  In other words it is no longer part of a
document.

The document node is treated specially and is no longer needed when:

1. No more handles are pointing to it.
2. None of its owned nodes have any handles pointing to them.

Nodes not part of a document are deleted as soon as there are no handles
using them any longer, i.e. the client is done with them.  Nodes directly
and indirectly owned by a document node and that document node are deleted
as soon no handles point to any of them.  The combination of these two
policies ensures that no reachable nodes are deleted, and that they are
deleted as soon as they become unreachable.

Some found the performance of the DOM to be less than they hoped for from a
C++ DOM implementation, and devised an alternative approach named IDOM.  For
the purposes of this discussion, the original approach described above will
be referred to as DOM.  It was thought that reference counting was incurring
a large performance hit, and developers of IDOM abandoned the reference
counting in favor of the following policies:

1. All nodes that are created by the document are owned by that document and
are not deleted until the document itself is deleted.
2. If the document were obtained from the IDOM_Parser, then the parser
manages the document's lifetime.
3. If the document were obtained via IDOM_DOMImplementation, then the user
is required to manage the document's lifetime, i.e. delete it when done with
it.

In addition to the new memory policy, the IDOM_Document was made into its
own heap manager for its owned nodes, which meant that upon document
deletion, many individual node deletions are avoided and instead a few
blocks are returned back to the system.

More related to feel than to performance, the IDOM got rid of the
handle/body pattern and instead return direct pointers to nodes for clients
to work with.  A similar thing was done with strings, a direct XMLCh pointer
is returned from nodes instead of a DOMString object.

Current situation:

The current situation is that both DOM and IDOM options are made available
to Xerces users, with the IDOM deemed experimental and subject to change.
This duality, while useful in the short term as an experiment, is harmful if
left around too long, as it is not clear to users which is best to use, and
to developers which is best to extend with features from DOM level 3, and so
on.

Going forward:

One approach to solving the duality is to eliminate the DOM interfaces in
favor of the IDOM interfaces.  While this is seems attractive from a
performance standpoint, there are many drawbacks:

� Xerces becomes fixed to the IDOM memory model.  The IDOM returns direct
pointers to elements and strings to users, and with direct pointers there is
no way to know how long the pointer is in use.  The IDOM's solution to this
problem is to adopt a policy of keeping all elements and strings in memory
so long as the owning document is alive.  Other memory models, such as those
that cache unused node on disk, and/or compress them, and so on, become
impossible to implement because of the lack of knowledge of when a node is
in use and when it is not.
� Backward compatibility with DOM is lost.  The DOM interfaces have been
around for a long time as the official Xerces interface, and moving to IDOM
as the official interface will force existing DOM users to make many changes
to their application.
� Some similarity with the Java version of Xerces is lost.  This similarity
reduces the learning curve for those that move from the Java Xerces to the
C++ Xerces for performance or other reasons.
� Users are drawn into managing the IDOM memory model.  If they get a
document from the parser, then they need to keep the parser around as long
as they use the document.  If they get the document from the
IDOM_DOMImplementation interface, then they are responsible for deleting it.
If they get an IDOM_DocumentType from the IDOM_DOMImplementation interface,
then they are again responsible for deleting it.  While it is common for C++
users to be drawn into managing memory, ease of use is adversely affected
(which is why so many patterns and patterns that remove this responsibility
exist); the relative sizes of the DOM and IDOM user guides illustrate this,
the IDOM user guide has to spend a great deal of time explaining how to
manage memory that the DOM guide simply doesn't.
� There is currently a serious memory leak (bug 7645) which even when fixed
will mean that users are further drawn into managing the IDOM memory model.
The leak occurs because once a node has been added to the document it is
never deleted from its storage pool, even when removed.  The first part of
fixing this problem is to provide an overloaded delete operator that removes
nodes from the storage pool to balance the overloaded new operator used to
place nodes in the storage pool.  The second part is to further expand the
IDOM user guide to inform users that they must manually delete any removed
nodes that they are done with.

Another approach to solving the duality is to abandon the experimental IDOM
altogether, but this is not attractive, as we don't want to loose its
performance benefits.

Alas we need a new approach; one that:

� Is as backwards compatible with the current DOM as possible.
� Does not dictate a particular memory model or DOM implementation.
� For best performance given general use, uses the IDOM implementation as
the default implementation.
� Retains the IDOM performance improvement.
� Does not leak memory.

DOM-IDOM Integration:

I recently submitted enhancement request 5967 (DOM-IDOM Integration), which
has attached all the changes needed for a new approach that meets these
goals.  The approach has been evolving and maturing, and this write-up aims
at collecting the various scraps I have written about the changes into one
place and fill in gaps with the hope that doing so will encourage adoption.

The idea behind the DOM-IDOM integration was to merge DOM's use of the
handle/body pattern with IDOM's implementation.  Because the new design aims
at supporting any number of alternative body implementations, the IDOM
implementation is not made the implementation, rather it is setup as the
default implementation, and other implementations can be substituted without
affecting clients of the DOM handles.

Use of the handle/body pattern is crucial to meeting our goals; with the
handle/body pattern, the specific implementation used for the body is hidden
from users, who only work with handles.  Furthermore, when a handle points
to a body it represents current use of the body, the knowledge of which
different implementations can use as they need.  For example, while the
current IDOM implementation keeps all of a document's nodes in memory (which
can be a scalability problem), an alternative implementation can retrieve
nodes from disk when needed and return them when no longer needed (solving
the scalability problem).

With the IDOM implementation used as the default implementation, a well
performing DOM is provided for those that can fit their entire documents in
memory.

The existing DOM handle classes were sufficient for use as the new handle
classes, so I kept them (this also assured meeting the goal of backwards
compatibility for users of the DOM interfaces).  The existing DOM body
classes that the handle classes used, though, were the specific DOM
implementation classes, and not abstract base classes that represent the
required interface that any implementation must meet.  This meant that the
DOM body classes where unsuitable for meeting the goal of having pluggable
implementations, and thus was unsuitable for the new design. The IDOM, on
the other hand, did have abstract base classes for each of the node types,
which along with the goal of having the IDOM implementation be the default
implementation made the IDOM abstract base classes ideal for the body base
classes.

Assuming that the IDOM implementation was better suited for the default
implementation, I discarded the DOM implementation classes.  If later
desired, though, the DOM implementation classes could be adapted to derive
from the new body base classes (the IDOM abstract base classes) and become
an alternative implementation.  Some informal testing that I have done found
DOM to outperform IDOM in some circumstances (mainly with large documents),
so this may actually be desirable.

Handles communicate to bodies that they are using them by calling addRef on
the body upon usage start and removeRef upon usage end.  These are virtual
methods on the IDOM_Node abstract base class and can be overridden and used
by some implementations, and ignored by others.

Default IDOM implementation reference counting:

The new design aims at avoiding drawing users into maintaining a specific
implementation's memory model, as is currently done with the IDOM.  To do
this the IDOM implementation must be modified to utilize reference counting.
By wait you say, wasn't reference counting one of the performance problems
that the IDOM was designed to solve.  Well, yes and no.  Here is an excerpt
from the IDOM user manual:

The C++ IDOM implementation no longer uses reference counting for automatic
memory management. The C++ IDOM uses an independent storage allocator per
document. The storage for a DOM document is associated with the document
node object. The advantage here is that allocation would require no
synchronization in most cases (based on the same threading model that we
have now - one thread active per document, but any number of documents
running in parallel with separate threads).

The allocator does not support a delete operation at all - all allocated
memory would persist for the life of the document, and then the larger
blocks would be returned to the system without separately deleting all of
the individual nodes and strings within the document.

The performance benefit the IDOM provides is gained by utilization of a
document owned storage allocator, which does not require synchronization
like the general heap manager does.  Note that reference counting alone is
not a problem.  Think about it, compared to everything else that is done
when using the DOM, simply avoiding incrementing and decrementing reference
counters will have negligible effect on performance.

As it turns out, reference counting a useful component to solving one of
IDOM's biggest problems, that of leaking memory.  The problem is that the
memory for any removed nodes are not released until the document is
destroyed.  The document's storage allocator needs to be updated to allow
reclaiming memory of removed nodes (by adding an overloaded operator delete
to balance out its overloaded operator new), and reference counting can make
it easy to know when to call delete on nodes.

The policy used by the original DOM will suffice for the new design:

A node body is no longer needed when:

1. No more handles are pointing to it.
2. It has no parent node.  In other words it is no longer part of a
document.

The document node is treated specially and is no longer needed when:

1. No more handles are pointing to it.
2. None of its owned nodes have any handles pointing to them.

IDOM implementation changes:

1. Add an overloaded operator delete to balance the overloaded operator new
provided by IDDocumentImp.
2. Add reference counters to IDNodeImp and IDDocumentImp.

-----Original Message-----
From: Tinny Ng [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 15, 2002 2:48 PM
To: [EMAIL PROTECTED]
Subject: Re: DOM 3 Patches?


Jason.

> Are you willing to accept patches for DOM level 3 implementations
Yes, but one second ...

Remember sometimes ago I've post a Apache C++ DOM Binding proposal?  I am
now reviewing the comment and trying to reorganize the IDOM (e.g. rename
IDOM_DOMXXX to DOMXXX as discussed, this also matches the DOM L3 naming
convention (e.g. DOMBuilder, DOMErrorHandler ... etc.)).

Give me a few more days, and I will post the prototype in the mailing list
for review.  Then you can submit your patch based on this new prototype.

Tinny


----- Original Message -----
From: "Jason E. Stewart" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, April 15, 2002 1:30 PM
Subject: DOM 3 Patches?


> Hey Tinny et. al.,
>
> Are you willing to accept patches for DOM level 3 implementations in
> IDOM? I'd really like to add support for the new 'encoding',
> 'version', and 'standalone' attributes of DOM_Document. That way I can
> handle XML declarations properly.
>
> What say you?
>
> Thanks,
> jas.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: DOM 3 Patches?

Reply via email to