|
I
don't know enough about the implementations of DOMString or the IDOM tree model,
so I will phrase my comments partly as questions.
Is the internal implementation also going to use this lightweight
string class to maintain it's strings? If not, how would we guarantee that
nobody is going to change the memory that this DOMString is pointing to? If we
cannot guarantee this, then the semantics of this 'DOMString' may be
confusing to users.
Also,
what about 'read-only' thread safety. I know this is not necessarily a design
goal here, but it sure would be nice to have. Has any thought
been given to this? Or is the current thinking that each thread must be looking
at it's own copy of the DOM.
Again,
from what I have understood of Lenny's handle/body implementation we do not have
'read-only' thread safety. Consider the case where two threads are pointing to
the same element from a given document tree. As both of them destruct their
handle object, both handles will attempt to decrement the reference count
and we could end up with a race condition here. This MAY result
in unpredictable behavior, unless we implement the reference counting in a
thread safe manner. This is especially true on SMP machines. I may have this all
wrong and this may be thread-safe, in which case I
apologize.
Simply
by implementing thread safe increment/decrement, we can guarantee that as long
as NO CHANGES are made to the document itself, multiple threads can be reading
various parts of it. This is because on each object there will be at least one
handle holding a reference count (i.e. the main document itself), hence no
objects will need to be added to the allocator's free list. Note that if we end
up deleting something, and this has to be added to the document's free list then
the allocator needs to be thread-safe. Note also that making the per document
allocator thread safe may not be too bad as there will rarely be contention for
this allocator. We can use a non-yielding spin latch (STLport does one) which
would mean very little overhead for having a truly thread-safe (read-only) DOM
model. Note that in many cases the spin latch is highly optimized by relying on
OS specific interfaces for atomic operations (Win32 InterlockedXXXX functions),
or in some cases hand optimizing assembly to implement atomic operations
(STLport does this on Solaris).
This
is one thing that the existing IDOM has going for it, i.e. it is
'read-only' thread safe.
More
of my two bits...
Samar
Lotia
I just had
a new thought; if having a DOMString class is desired, for functionality
and/or DOM compliance, then the smart pointer approach can still be used by
updating the IDOM classes to return DOMString instances instead of
XMLCh*. With using smart pointers we would still only have one set
of interfaces to maintain, and performance would be negligibly affected as I
pointed out earlier that I modified DOMString to simply wrap an alias to the
node owned XMLCh* data, and only makes a copy if
modified.
Lenny
Hi
Samar,
You make
good points.
I would
agree that it is reasonable to nix the DOMString, but does anyone object to
that given that DOMString is explicitly specified in the W3C DOM
specification? Judging so far from the early responders to the vote,
no, as folks voting for the IDOM interface are also voting to nix the
DOMString class.
(Tinny), do you
anticipate the W3C to complain if the C++ binding does not have
a DOMString? In other words, will we be able to call ourselves
DOMx compliant without it?
One more
consequence of using the smart pointer approach is that backwards
compatibility with the original DOM interfaces is sacrificed for backwards
compatibility with the IDOM interfaces. I thought that with the
original DOM interfaces being officially supported and around longer that
backwards compatibility to it would be more important, but so far I no one
using the original DOM interface has spoken up. For my use
cases it simply doesn't matter, what matters most to me is functional
behavior and ease of use.
Just to
make it easier to review, here is the earlier example following your
suggestion to avoid using an int operator on node for null
comparison:
if (!pm_Element.isNull())
pm_Element->getAttribute(...);
Lenny
If the
desire is to maintain only one interface, then I would be of the opinion
that we should nix the DOMString class and use a 'smart pointer' class to
wrapper the internal interfaces. In many cases, people will likely have
their own preferred string class which they use and will immediately
convert the value extracted from the DOM before passing into any other
layer of their code.
If we
keep DOMString around, I would recommend against having a (const XMLCh *)
operator as this can result in some incredibly hard to track errors. Most
C++ style guides recommend against implicit conversion operators. Note the
lack of such an operator in the C++ standard library string, i.e.
std::basic_string<T>. Having something like rawBuffer, or XMLCh()
would be clearer and lets one control lifetimes in a much clearer way
(IMHO).
Also, I
would recommend against adding an int operator on the smart pointer class.
It is not that much work to call isNull on the object, and is much clearer
from a readability perspective as well as helps catch silly errors at
compile time. If we must have such an operator then it may be better
to add a bool operator instead of int, as this will likely reduce the
number of places where the implicit conversion operator will be
called.
My two
bits...
Samar
Lotia
Hi
Markus,
Thank
you very much for the insight.
Note
that simply accessing the IDOM implementation via handles does not
affect its thread safety-ness, thus your application is
safe.
if (pm_Element)
pm_Element->getAttribute(...);
How can I do
this with references?
You do it with the current handles like
this:
if (!pm_Element.isNull())
pm_Element.getAttribute(...);
Adding an int operator to DOM_Node would allow
even more friendly syntax; e.g.
if (pm_Element)
pm_Element.getAttribute(...);
This
could be easily added.
In
fact, an -> operators could be added to the DOM_Node classes and get
this:
if (pm_Element)
pm_Element->getAttribute(...);
This
is now exactly what you started out with, thus is completely backward
compatible with your current use of the IDOM.
XMLCh* are
easier to handle as DOMString-Objects in ATL : CComBSTR cBstr =
pm_Element->getAttribute(...);
Good
point, the current DOMString class does not have an XMLCh* operator,
which if it did would solve your problem. I pretty much
gutted the original DOMString class to make it a simple wrapper around
an XMLCh* returned from IDOM implementations, in lieu of suffering the
costs of a the cross document string management of the original
DOM. As far as I can tell the only reason the original DOMString
did not have an XMLCh* operator was because there was no guarantee that
its internal XMLCh* was null terminated; well, that guarantee does now
exist and the operator can be added -- I will do that. So your
example remains:
CComBSTR cBstr =
pm_Element->getAttribute(...);
Note
that string classes are convenient way to perform various
operations on a string without using the static (read functional)
methods provided by XMLString. I even implemented COW (copy
on write) behavior in the new DOMString class, so that you can feel free
to modify a string returned from a node without having to manually make
a copy.
If
folks don't find the DOMString wrapper to be that important, that frees
me up to simplify the handle classes and address one of Tinny's
concerns. Tinny pointed out that while the new design hides dual
interfaces (DOM and IDOM) from users, it does not hide them from
DOM developers; as DOM 3 support is added, each interface change
would have to be made to both DOM and IDOM classes. The only
reason I went with complete interface replication instead of simple
smart pointers for the handle classes was to be able to translate XMLCh
pointers returned from IDOM nodes into DOMStrings. If I am allowed
to get rid of DOMString altogether I can make the handle classes simple
smart pointers that do not replicate IDOM interfaces, and thus the
duplication of effort is eliminated.
Lenny
-----Original
Message----- From: Markus Fellner
[mailto:[EMAIL PROTECTED]] Sent: Monday, April 29, 2002 6:17
PM To: [EMAIL PROTECTED];
[EMAIL PROTECTED] Subject: AW: Call for Vote:
which one to be the Xerces-C++ public supported W3C DOM
interface
O.k
the main reaseon for my IDOM flirtation is...
I've chosen IDOM cause of its thread-safeness. And
now I have several thousands lines of code using
IDOM interface.
Some other reasons are...
I
have many IDOM_Element* members (pm_Elem) in my
classes. After parsing they will be assigned one time and than
many times checked if they are really assigned and used for reading
and writing attributes.
if
(pm_Element)
pm_Element->getAttribute(...);
How can I do this with references?
XMLCh* are easier to handle as
DOMString-Objects in ATL : CComBSTR cBstr
= pm_Element->getAttribute(...);
...
Sorry for my short answer. I go on
holiday tomorrow and i have to pack
up!
I'm
back in 2 weeks and looking forward to see the results of this
voting.
It's a pitty to go during a hot discussion on this
list.
Markus
Hi Markus,
To be clear, the fix I created for the IDOM
was to recycle memory once a node or string is no longer
needed. To know when a node is no longer needed I used
the original DOM interface, but have them wrapping up the IDOM as
the implementation. IDOM performance is maintained, but ease
of use is greatly improved. Without using the DOM handles to
know when an IDOM node is in use or not, application code will be
drawn into explicitly stating when a node is no longer needed and
can be recycled, which is yet another thing to be documented and to
for application developers to get wrong and suffer consequences
for.
If you love and use the IDOM for its
performance, you want the memory problem fixed so that it is
really fixed, not a workaround that only works if your application
does everything right, then you will love what I have done with
combining DOM classes as handles, and IDOM classes as
bodies.
If what you love is working with pointers
instead of with objects, please let me know why.
One thing I have found harder with objects
vs.. pointers is down casting from node to derived objects like
element. The syntax is a bit cleaner with pointers;
e.g.:
DOM_Node node =
...
DOM_Element elem =
(const DOM_Element&)node;
vs:
IDOM_Node* node =
..
IDOM_Element* elem =
(IDOM_Element*)node;
It is easy to forget to add the const in
the first case, and is somewhat non-intuitive because slicing can
happen, though it is not problem in this
case.
To solve this problem I have thought of
adding overloaded constructors and assignment operators that take a
DOM_Node to DOM_Node derived classes like DOM_Element. Thus
the first example becomes:
DOM_Node node =
...
DOM_Element elem =
node;
Not only is this code more succinct, but it
is safer, as the overloaded constructor and assignment operator can
check for node compatibility via the getNodeType
call.
Again, please let me know what other
aspects of points make things easier for you.
> Hope your fix has no effects on
thread-safe-ness!
No affect whatsoever.
Lenny
Hi Lenny,
I hope your fix of the IDOM memory
problem goes into the next official release. But I use and love
the IDOM interface.
It's really easier for an old C++
programmer like me! And I use IDOM cause of its threadsafe
properties. Hope your fix has no effects on
thread-safe-ness!
Markus
Hi Markus,
The memory management problem solved by recycling
no longer used nodes and strings. The only clean way I
know to know when nodes and strings are being used is to use the
handle/body pattern, which is what is used by the original
DOM. What I have done is use the original DOM handles and
the IDOM implementation, but fixed the IDOM memory
problem.
Lenny
If the memory management problem is solved, I
prefer IDOM!!!
Hi everyone,
I've reviewed Andy's design
objective of IDOM, Lenny's view of old DOM and his proposal
of redesign, and some users feedback.
Here is a "quick" summary and I would like to call for
a VOTE about the fate of these two interfaces.
1.0 Objective
==========
1. Define the strategy of
Xerces-C++ public DOM interface. Decide which one to keep, old DOM interface or
new IDOM interface
2.0 Motivation
===========
1. As a long term strategy, Xerces-C++
shouldn't define two W3C DOM interfaces which simply
confuses users.
=> We've already
got many users' questions about what the difference, which
one to use ... etc.
2. With limited resource, we should
focus our development on ONE stream, no more duplicate
effort
=> New DOM Level
3 development should be done on one interface, not
both.
=> No more dual
maintenance: two set of samples (e.g. DOMPrint vs
IDOMPrint), two parsers (DOMParser vs
IDOMParser)
=> To encourage
more users to develop DOM application AND implementation
based on this binding.
=> Such binding
should just define a set of abstract base classes (similar
to JAVA interface) where no implementation
model is assumed
3.0 History
=========
'DOM' was the initial "W3C DOM
interface" developed by Xerces-C++. However the
performance of its implementation is not quite
satisfactory.
Last year, Andy Heninger came up
with a new design with faster performance, and such
implementation came with a new set of interface
=> 'IDOM'.
Currently both 'DOM' and 'IDOM' are
shipped with Xerces-C++. 'IDOM' is claimed as
experimental (like a prototype) and is subject to
change.
More information can be found in
: http://xml.apache.org/xerces-c/program.html
4.0 IDOM
=========
4.1 Interface
==========
4.1.1 Features of IDOM
Interface
--------------------------------------------------
e.g. virtual IDOM_Element*
IDOM_Document::createElement(const XMLCh* tagName) =
0;
1. Define as abstract base classes
2. Use normal C++ pointers.
=> So that abstract base class is
possible.
=> Make it more
C++ like. Less Java like.
4.1.2 Pros and Cons of IDOM
Interface
----------------------------------------------------------
Pros:
1. Abstract base classes that correspond to the W3C
DOM interfaces
=> Can be
recommended as Apache DOM C++ Binding
=> More standard
like, no implementation assumed as they are just abstract
interfaces using pure virtual functions
2. (Depends on users'
preference)
- someone prefers
C++ like style
Cons:
1. IDOM_XXX - weird prefix
'I'
Solution:
- Proposed to rename to DOMXXXX
which also matches the DOM Level 3 naming
convention
2. (Depends on users'
preference)
- someone does not
like pointers, and wants Java-like interface for ease to
use, ease to learn and ease to port (from
Java).
3. As the old DOM interface has been
around for a long time, majority of current Xerces-C++ still
uses the old DOM interface, significant migration
impact
Solution:
- Announce the deprecation of old DOM interface for a couple
of releases before removal
4.2 Implementation
===============
4.2.1
Features of IDOM Implementation
-----------------------------------------------------------
1. Use an
independent storage allocator per document. The advantage
here is that allocation would require no synchronization
=> Fast, good
scalability, reduced memory footprint
2. Use plain,
null-terminated (XMLCh *) utf-16 strings.
=> No DOMString
class overhead which is another performance
contributor that makes IDOM faster
4.2.2 Downside of IDOM
Implementation
-------------------------------------------------------------
1. Manual memory management
- If document comes
from parser, then parser owns the document. If
document comes from DOMImplementation, then users are
responsible to delete it.
Solution:
- Provide a means of disassociating a document from the
parser
- Add a function "Node::release()", similar to the
idea of "Range::detach", which allows users to indicate
the release of the Node.
- From C++ Binding abstract interface
perspective, it's up to implementation how to handle this
"release()" function.
- With Xerces-C++ IDOM implementation,
the release() function will delete the 'this'
pointer if it is a document, else no-op.
2. Memory retained until the
document is deleted.
- If you change the
value of an attribute or call removeNode many
times, the memory of the old value is not deallocated
for reuse and the document grows and grows
Solution:
- This in fact is a tradeoff for the fast performance
offered by independent storage allocator.
- There is no immediate good solution in
place
5.0 old DOM
==========
5.1 Interface
==========
5.1.1 Features of old DOM Interface
-----------------------------------------------------
e.g. DOM_Element
DOM_Document::createElement(const DOMString
tagName);
1. Use smart pointers
- Java-like
5.1.2 Pros and Cons of old DOM
Interface
--------------------------------------------------------------
Pros:
1. DOM_XXX - reasonable
name
2. (Depends on users'
preference)
-
someone wants Java-like interface for ease to use, ease
to learn and ease to port (from Java).
3. Not that many users have migrated to IDOM yet,
so migration impact is minimal.
Cons:
1. Not abstract base class
- Cannot be recommended as
Apache DOM C++ Binding
- Implementation (smart pointer
indirection) is assumed
Solution:
- This in fact is a tradeoff for the ease of use
of smart pointer design
- No solution.
2. (Depends on users'
preference)
- someone wants
C++-like as this is C++ interface
5.2 Implementation
===============
5.2.1 Features of old DOM
Implementation
----------------------------------------------------------------
1. Automatic
memory management
-
Memory is released when there is no more handles pointing to
it
- Use reference
count to keep track of handles
2. Use thread-safe DOMString
class
5.2.2 Downside of old DOM
Implementation
--------------------------------------------------------------------
1. Performance is slow
- Memory management
is the biggest time consumer, and a lot of memory
footprint.
- There are a whole
lot of blocks allocated when creating a document and then
freed when finished with it. Each and every node requires at
least one and sometimes several separately allocated blocks.
DOMString take three. It adds up.
Solution:
- Lenny suggests to use IDOM interface internally in
DOM implementation, patch
in Bugzilla 5967
- Then the performance benefits of IDOM is
gained but the memory retained problem in IDOM
implementation still remains to address.
- And internally, we will have dual
interface maintenance model as IDOM interface
is then used by DOM internally.
Vote Question:
============
I would like to call for a
vote:
==> Which
INTERFACE should be the Xerces-C++ public supported W3C
DOM Interface, DOM or IDOM? <===
Note:
1. The question is asking which "interface" to
be officially supported. Once the choice of
interface is chosen, we can discuss how to solve the
downside of implementation as the next topic.
2. The one being voted will become the
ONLY Xerces-C++ supported public W3C DOM Interface, and is
where the DOM Level 3 being implemented.
3. The API of the other
interface will be deprecated. And
its samples, and associated Parser
will eventually be removed from the distribution
|