|
Hi
Tinny,
- Lenny suggests to use IDOM interface internally in
DOM implementation, patch in Bugzilla 5967
- Then the performance benefits
of IDOM is gained but the memory retained problem in IDOM
implementation still remains to address.
Actually, I have now
addressed the memory retention problem. I am running some tests before
adding the changes to patch 5967. The solution is to recycle allocated,
but no longer in use memory, and is facilitated by reference counting.
Here is a description of my approach.
The IDDocument storage allocator is only designed to hand
out memory in chunks, not release it in chunks, and updating it to support
deletes would greatly complicate its design, as well as diminish its
performance. This means we need an
alternative to deleting nodes once they are no longer needed.
An alternative to deleting is recycling. Recycling for nodes can be accomplished
by having IDDocument keep a free list for each type of node. Nodes no longer needed are added to the
free list for their node type, which are in turn emptied by the document before
creating a new node. By overloading
the owning node pointer in free nodes to mean the next free node, no additional
data must be added to nodes to support free lists.
IDDocument�s storage allocator not only allocates storage
for nodes, but also strings. Most
of a document�s strings are organized into a string pool, which uses a hash map
to keep just one copy of any particular string value.
It is interesting to note that the original DOM also had a string
pool, one that its documentation states is for the purpose of recycling element
and attribute names, but it only uses it for non-namespace elements, namespace
elements and all attributes get their own name strings.
In the IDOM, IDElementImpl, IDElementNSImpl, IDAttrImp, and
IDAttrNSImp do use the string pool for their names, but so does
IDCharacterDataImp for its value. A
string pool is extremely useful for storing a limited number of repeating string
values like element and attribute names, but is less useful for storing the
value of text nodes, which will usually vary more in value than element and
attribute names do. It is
understandable that without any means of releasing memory for a string, that the
string pool was used for text node values, for a typical document will have some
repeating text node values, like �true� and �false� for Boolean attributes, and
some memory will be reused. But,
this is not a complete solution, because it is likely that many documents will
have a larger number of unique text node values, which means that much memory
will not get reused, and that the performance of the string pool will be
diminished with its increased size.
To solve these problems, while keeping the string pool for element
and attribute names, I added a string heap for arbitrary strings. The string heap gets its memory from the
documents storage allocator, but support the returning of strings to the
heap. I changed IDCharacterDataImp
from getting its string storage from the string pool to instead get it from the
string heap. Also upon recycling, I
have it return its current storage to the string heap and get its new required
storage from the string heap.
Lenny
Hi everyone,
I've reviewed Andy's design objective of IDOM,
Lenny's view of old DOM and his proposal of redesign, and some users
feedback. Here is a "quick" summary and I would like to call
for a VOTE about the fate of these two interfaces.
1.0 Objective
==========
1. Define the strategy of Xerces-C++ public DOM
interface. Decide which one to keep, old
DOM interface or new IDOM interface
2.0 Motivation
===========
1. As a long term strategy, Xerces-C++ shouldn't define
two W3C DOM interfaces which simply confuses users.
=> We've already got many users'
questions about what the difference, which one to use ... etc.
2. With limited resource, we should focus our
development on ONE stream, no more duplicate effort
=> New DOM Level 3 development
should be done on one interface, not both.
=> No more dual maintenance: two
set of samples (e.g. DOMPrint vs IDOMPrint), two parsers (DOMParser vs
IDOMParser)
=> To encourage more users to
develop DOM application AND implementation based on this binding.
=> Such binding should just define
a set of abstract base classes (similar to JAVA interface) where no
implementation model is assumed
3.0 History
=========
'DOM' was the initial "W3C DOM interface" developed
by Xerces-C++. However the performance of its implementation is not
quite satisfactory.
Last year, Andy Heninger came up with a new
design with faster performance, and such implementation came with a new
set of interface => 'IDOM'.
Currently both 'DOM' and 'IDOM' are shipped with
Xerces-C++. 'IDOM' is claimed as experimental (like a prototype)
and is subject to change.
More information can be found in
: http://xml.apache.org/xerces-c/program.html
4.0 IDOM
=========
4.1 Interface
==========
4.1.1 Features of IDOM Interface
--------------------------------------------------
e.g. virtual IDOM_Element*
IDOM_Document::createElement(const XMLCh* tagName) = 0;
1. Define as abstract base
classes
2. Use normal C++
pointers.
=> So that abstract base class is possible.
=> Make it more C++ like. Less
Java like.
4.1.2 Pros and Cons of IDOM Interface
----------------------------------------------------------
Pros:
1. Abstract base classes that correspond to the W3C DOM
interfaces
=> Can be recommended
as Apache DOM C++ Binding
=> More standard like, no
implementation assumed as they are just abstract interfaces using pure virtual
functions
2. (Depends on users' preference)
- someone prefers C++ like
style
Cons:
1. IDOM_XXX - weird prefix
'I'
Solution:
- Proposed to rename to DOMXXXX which also matches the DOM Level 3
naming convention
2. (Depends on users' preference)
- someone does not like pointers, and
wants Java-like interface for ease to use, ease to learn and ease to port
(from Java).
3. As the old DOM interface has been around for a long
time, majority of current Xerces-C++ still uses the old DOM interface,
significant migration impact
Solution:
- Announce the
deprecation of old DOM interface for a couple of releases before
removal
4.2 Implementation
===============
4.2.1 Features of
IDOM Implementation
-----------------------------------------------------------
1. Use an independent
storage allocator per document. The advantage here is that allocation would
require no synchronization
=> Fast, good scalability, reduced
memory footprint
2. Use plain, null-terminated (XMLCh *) utf-16 strings.
=> No DOMString
class overhead which is another performance contributor that makes
IDOM faster
4.2.2 Downside of IDOM Implementation
-------------------------------------------------------------
1. Manual memory management
- If document comes from parser, then
parser owns the document. If document comes from DOMImplementation, then
users are responsible to delete it.
Solution:
- Provide a means
of disassociating a document from the parser
- Add a function
"Node::release()", similar to the idea of "Range::detach", which allows
users to indicate the release of the Node.
- From C++ Binding abstract interface perspective, it's up to implementation
how to handle this "release()" function.
- With Xerces-C++ IDOM implementation, the release()
function will delete the 'this' pointer if it is a document, else
no-op.
2. Memory retained until the document is
deleted.
- If you change the value of an
attribute or call removeNode many times, the memory of the old
value is not deallocated for reuse and the document grows and
grows
Solution:
- This in fact is
a tradeoff for the fast performance offered by independent storage
allocator.
- There is no immediate good
solution in place
5.0 old DOM
==========
5.1 Interface
==========
5.1.1 Features of old DOM Interface
-----------------------------------------------------
e.g. DOM_Element DOM_Document::createElement(const
DOMString tagName);
1. Use smart pointers -
Java-like
5.1.2 Pros and Cons of old DOM Interface
--------------------------------------------------------------
Pros:
1. DOM_XXX - reasonable name
2. (Depends on users' preference)
- someone wants Java-like
interface for ease to use, ease to learn and ease to port (from
Java).
3. Not that many users have migrated to IDOM yet, so migration
impact is minimal.
Cons:
1. Not abstract base class
- Cannot be recommended as Apache DOM C++
Binding
- Implementation (smart pointer indirection) is
assumed
Solution:
- This in
fact is a tradeoff for the ease of use of smart pointer
design
- No
solution.
2. (Depends on users' preference)
- someone wants C++-like as this is
C++ interface
5.2 Implementation
===============
5.2.1 Features of old DOM
Implementation
----------------------------------------------------------------
1. Automatic memory
management
- Memory is released
when there is no more handles pointing to it
- Use reference count to keep track
of handles
2. Use thread-safe DOMString class
5.2.2 Downside of old DOM Implementation
--------------------------------------------------------------------
1. Performance is slow
- Memory management is the biggest
time consumer, and a lot of memory footprint.
- There are a whole lot of blocks
allocated when creating a document and then freed when finished with it. Each
and every node requires at least one and sometimes several separately
allocated blocks. DOMString take three. It adds up.
Solution:
- Lenny suggests
to use IDOM interface internally in DOM implementation, patch in Bugzilla 5967
- Then the
performance benefits of IDOM is gained but the memory retained problem in IDOM
implementation still remains to address.
- And internally,
we will have dual interface maintenance model as IDOM interface
is then used by DOM internally.
Vote Question:
============
I would like to call for a vote:
==> Which INTERFACE should
be the Xerces-C++ public supported W3C DOM Interface, DOM or IDOM?
<===
Note:
1. The question is asking which "interface" to be officially
supported. Once the choice of interface is chosen, we can discuss how to
solve the downside of implementation as the next topic.
2. The one being voted will become the ONLY Xerces-C++
supported public W3C DOM Interface, and is where the DOM Level 3 being
implemented.
3. The API of the other interface will be
deprecated. And its samples, and associated Parser
will eventually be removed from the distribution
|