RE: Call for Vote: which one to be the Xerces-C++ publicsupporte d d W3C DOM interface

Samar Lotia Mon, 29 Apr 2002 20:28:41 -0700

I don't know enough about the implementations of DOMString or the IDOM tree model, so I will phrase my comments partly as questions.

Is the internal implementation also going to use this lightweight string class to maintain it's strings? If not, how would we guarantee that nobody is going to change the memory that this DOMString is pointing to? If we cannot guarantee this, then the semantics of this 'DOMString' may be confusing to users.

Also, what about 'read-only' thread safety. I know this is not necessarily a design goal here, but it sure would be nice to have. Has any thought been given to this? Or is the current thinking that each thread must be looking at it's own copy of the DOM.

Again, from what I have understood of Lenny's handle/body implementation we do not have 'read-only' thread safety. Consider the case where two threads are pointing to the same element from a given document tree. As both of them destruct their handle object, both handles will attempt to decrement the reference count and we could end up with a race condition here. This MAY result in unpredictable behavior, unless we implement the reference counting in a thread safe manner. This is especially true on SMP machines. I may have this all wrong and this may be thread-safe, in which case I apologize.

Simply by implementing thread safe increment/decrement, we can guarantee that as long as NO CHANGES are made to the document itself, multiple threads can be reading various parts of it. This is because on each object there will be at least one handle holding a reference count (i.e. the main document itself), hence no objects will need to be added to the allocator's free list. Note that if we end up deleting something, and this has to be added to the document's free list then the allocator needs to be thread-safe. Note also that making the per document allocator thread safe may not be too bad as there will rarely be contention for this allocator. We can use a non-yielding spin latch (STLport does one) which would mean very little overhead for having a truly thread-safe (read-only) DOM model. Note that in many cases the spin latch is highly optimized by relying on OS specific interfaces for atomic operations (Win32 InterlockedXXXX functions), or in some cases hand optimizing assembly to implement atomic operations (STLport does this on Solaris).

This is one thing that the existing IDOM has going for it, i.e. it is 'read-only' thread safe.

More of my two bits...

Samar Lotia

-----Original Message-----
From: Lenny Hoffman [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 29, 2002 21:47
To: [EMAIL PROTECTED]
Subject: RE: Call for Vote: which one to be the Xerces-C++ public supported d W3C DOM interface

I just had a new thought; if having a DOMString class is desired, for functionality and/or DOM compliance, then the smart pointer approach can still be used by updating the IDOM classes to return DOMString instances instead of XMLCh*. With using smart pointers we would still only have one set of interfaces to maintain, and performance would be negligibly affected as I pointed out earlier that I modified DOMString to simply wrap an alias to the node owned XMLCh* data, and only makes a copy if modified.

Lenny

-----Original Message-----
From: Lenny Hoffman [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 29, 2002 9:37 PM
To: [EMAIL PROTECTED]
Subject: RE: Call for Vote: which one to be the Xerces-C++ public supporte d W3C DOM interface

Hi Samar,

You make good points.

I would agree that it is reasonable to nix the DOMString, but does anyone object to that given that DOMString is explicitly specified in the W3C DOM specification? Judging so far from the early responders to the vote, no, as folks voting for the IDOM interface are also voting to nix the DOMString class.

(Tinny), do you anticipate the W3C to complain if the C++ binding does not have a DOMString? In other words, will we be able to call ourselves DOMx compliant without it?

One more consequence of using the smart pointer approach is that backwards compatibility with the original DOM interfaces is sacrificed for backwards compatibility with the IDOM interfaces. I thought that with the original DOM interfaces being officially supported and around longer that backwards compatibility to it would be more important, but so far I no one using the original DOM interface has spoken up. For my use cases it simply doesn't matter, what matters most to me is functional behavior and ease of use.

Just to make it easier to review, here is the earlier example following your suggestion to avoid using an int operator on node for null comparison:

if (!pm_Element.isNull())

    pm_Element->getAttribute(...);

Lenny

-----Original Message-----
From: Samar Lotia [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 29, 2002 7:59 PM
To: '[EMAIL PROTECTED]'
Subject: RE: Call for Vote: which one to be the Xerces-C++ public supporte d W3C DOM interface

If the desire is to maintain only one interface, then I would be of the opinion that we should nix the DOMString class and use a 'smart pointer' class to wrapper the internal interfaces. In many cases, people will likely have their own preferred string class which they use and will immediately convert the value extracted from the DOM before passing into any other layer of their code.

If we keep DOMString around, I would recommend against having a (const XMLCh *) operator as this can result in some incredibly hard to track errors. Most C++ style guides recommend against implicit conversion operators. Note the lack of such an operator in the C++ standard library string, i.e. std::basic_string<T>. Having something like rawBuffer, or XMLCh() would be clearer and lets one control lifetimes in a much clearer way (IMHO).

Also, I would recommend against adding an int operator on the smart pointer class. It is not that much work to call isNull on the object, and is much clearer from a readability perspective as well as helps catch silly errors at compile time. If we must have such an operator then it may be better to add a bool operator instead of int, as this will likely reduce the number of places where the implicit conversion operator will be called.

My two bits...

Samar Lotia

-----Original Message-----
From: Lenny Hoffman [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 29, 2002 19:38
To: [EMAIL PROTECTED]
Subject: RE: Call for Vote: which one to be the Xerces-C++ public supported W3C DOM interface

Hi Markus,

Thank you very much for the insight.

Note that simply accessing the IDOM implementation via handles does not affect its thread safety-ness, thus your application is safe.

if (pm_Element)

    pm_Element->getAttribute(...);

How can I do this with references?

You do it with the current handles like this:

if (!pm_Element.isNull())

    pm_Element.getAttribute(...);

Adding an int operator to DOM_Node would allow even more friendly syntax; e.g.

if (pm_Element)

    pm_Element.getAttribute(...);

This could be easily added.

In fact, an -> operators could be added to the DOM_Node classes and get this:

if (pm_Element)

    pm_Element->getAttribute(...);

This is now exactly what you started out with, thus is completely backward compatible with your current use of the IDOM.

XMLCh* are easier to handle as DOMString-Objects in ATL : CComBSTR cBstr = pm_Element->getAttribute(...);

Good point, the current DOMString class does not have an XMLCh* operator, which if it did would solve your problem.  I pretty much gutted the original DOMString class to make it a simple wrapper around an XMLCh* returned from IDOM implementations, in lieu of suffering the costs of a the cross document string management of the original DOM. As far as I can tell the only reason the original DOMString did not have an XMLCh* operator was because there was no guarantee that its internal XMLCh* was null terminated; well, that guarantee does now exist and the operator can be added -- I will do that. So your example remains:

CComBSTR cBstr = pm_Element->getAttribute(...);

Note that string classes are convenient way to perform various operations on a string without using the static (read functional) methods provided by XMLString. I even implemented COW (copy on write) behavior in the new DOMString class, so that you can feel free to modify a string returned from a node without having to manually make a copy.

If folks don't find the DOMString wrapper to be that important, that frees me up to simplify the handle classes and address one of Tinny's concerns. Tinny pointed out that while the new design hides dual interfaces (DOM and IDOM) from users, it does not hide them from DOM developers; as DOM 3 support is added, each interface change would have to be made to both DOM and IDOM classes. The only reason I went with complete interface replication instead of simple smart pointers for the handle classes was to be able to translate XMLCh pointers returned from IDOM nodes into DOMStrings. If I am allowed to get rid of DOMString altogether I can make the handle classes simple smart pointers that do not replicate IDOM interfaces, and thus the duplication of effort is eliminated.

Lenny

-----Original Message-----
From: Markus Fellner [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 29, 2002 6:17 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: AW: Call for Vote: which one to be the Xerces-C++ public supported W3C DOM interface

O.k the main reaseon for my IDOM flirtation is...

I've chosen IDOM cause of its thread-safeness. And now I have several thousands lines of code using IDOM interface.

Some other reasons are...

I have many IDOM_Element* members (pm_Elem) in my classes. After parsing they will be assigned one time and than many times checked if they are really assigned and used for reading and writing attributes.

if (pm_Element)

    pm_Element->getAttribute(...);

How can I do this with references?

XMLCh* are easier to handle as DOMString-Objects in ATL : CComBSTR cBstr = pm_Element->getAttribute(...);

...

Sorry for my short answer. I go on holiday tomorrow and i have to pack up!

I'm back in 2 weeks and looking forward to see the results of this voting.

It's a pitty to go during a hot discussion on this list.

Markus

-----Urspr�ngliche Nachricht-----
Von: Lenny Hoffman [mailto:[EMAIL PROTECTED]]
Gesendet: Montag, 29. April 2002 23:54
An: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
Betreff: RE: Call for Vote: which one to be the Xerces-C++ public supported W3C DOM interface

Hi Markus,

To be clear, the fix I created for the IDOM was to recycle memory once a node or string is no longer needed.   To know when a node is no longer needed I used the original DOM interface, but have them wrapping up the IDOM as the implementation. IDOM performance is maintained, but ease of use is greatly improved. Without using the DOM handles to know when an IDOM node is in use or not, application code will be drawn into explicitly stating when a node is no longer needed and can be recycled, which is yet another thing to be documented and to for application developers to get wrong and suffer consequences for.

If you love and use the IDOM for its performance, you want the memory problem fixed so that it is really fixed, not a workaround that only works if your application does everything right, then you will love what I have done with combining DOM classes as handles, and IDOM classes as bodies.

If what you love is working with pointers instead of with objects, please let me know why.

One thing I have found harder with objects vs.. pointers is down casting from node to derived objects like element. The syntax is a bit cleaner with pointers; e.g.:

    DOM_Node node = ...

    DOM_Element elem = (const DOM_Element&)node;

vs:

    IDOM_Node* node = ..

    IDOM_Element* elem = (IDOM_Element*)node;

It is easy to forget to add the const in the first case, and is somewhat non-intuitive because slicing can happen, though it is not problem in this case.

To solve this problem I have thought of adding overloaded constructors and assignment operators that take a DOM_Node to DOM_Node derived classes like DOM_Element. Thus the first example becomes:

    DOM_Node node = ...

    DOM_Element elem = node;

Not only is this code more succinct, but it is safer, as the overloaded constructor and assignment operator can check for node compatibility via the getNodeType call.

Again, please let me know what other aspects of points make things easier for you.

> Hope your fix has no effects on thread-safe-ness!

No affect whatsoever.

Lenny

-----Original Message-----
From: Markus Fellner [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 29, 2002 4:15 PM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: AW: Call for Vote: which one to be the Xerces-C++ public supported W3C DOM interface

Hi Lenny,

I hope your fix of the IDOM memory problem goes into the next official release. But I use and love the IDOM interface.

It's really easier for an old C++ programmer like me! And I use IDOM cause of its threadsafe properties. Hope your fix has no effects on thread-safe-ness!

Markus

-----Urspr�ngliche Nachricht-----
Von: Lenny Hoffman [mailto:[EMAIL PROTECTED]]
Gesendet: Montag, 29. April 2002 17:57
An: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Betreff: RE: Call for Vote: which one to be the Xerces-C++ public supported W3C DOM interface

Hi Markus,

The memory management problem solved by recycling no longer used nodes and strings. The only clean way I know to know when nodes and strings are being used is to use the handle/body pattern, which is what is used by the original DOM. What I have done is use the original DOM handles and the IDOM implementation, but fixed the IDOM memory problem.

Lenny

-----Original Message-----
From: Markus Fellner [mailto:[EMAIL PROTECTED]]
Sent: Monday, April 29, 2002 10:54 AM
To: [EMAIL PROTECTED]
Subject: AW: Call for Vote: which one to be the Xerces-C++ public supported W3C DOM interface

If the memory management problem is solved, I prefer IDOM!!!

-----Urspr�ngliche Nachricht-----
Von: Tinny Ng [mailto:[EMAIL PROTECTED]]
Gesendet: Montag, 29. April 2002 17:08
An: [EMAIL PROTECTED]
Betreff: Call for Vote: which one to be the Xerces-C++ public supported W3C DOM interface

Hi everyone,

I've reviewed Andy's design objective of IDOM, Lenny's view of old DOM and his proposal of redesign, and some users feedback.   Here is a "quick" summary and I would like to call for a VOTE about the fate of these two interfaces.

1.0 Objective

==========

1. Define the strategy of Xerces-C++ public DOM interface.  Decide which one to keep, old DOM interface or new IDOM interface

2.0 Motivation

===========

1. As a long term strategy, Xerces-C++ shouldn't define two W3C DOM interfaces which simply confuses users.

    => We've already got many users' questions about what the difference, which one to use ... etc.

2. With limited resource, we should focus our development on ONE stream, no more duplicate effort

    => New DOM Level 3 development should be done on one interface, not both.

    => No more dual maintenance: two set of samples (e.g. DOMPrint vs IDOMPrint), two parsers (DOMParser vs IDOMParser)

3. To better place Apache Xerces-C++ in the market, we should have our Apache Recommended DOM C++ Binding in http://www.w3.org/DOM/Bindings

    => To encourage more users to develop DOM application AND implementation based on this binding.

    => Such binding should just define a set of abstract base classes (similar to JAVA interface) where no implementation model is assumed

3.0 History

=========

'DOM' was the initial "W3C DOM interface" developed by Xerces-C++. However the performance of its implementation is not quite satisfactory.

Last year, Andy Heninger came up with a new design with faster performance, and such implementation came with a new set of interface => 'IDOM'.

Currently both 'DOM' and 'IDOM' are shipped with Xerces-C++.  'IDOM' is claimed as experimental (like a prototype) and is subject to change.

More information can be found in :
http://xml.apache.org/xerces-c/program.html

http://www.apache.org/~andyh/

http://marc.theaimsgroup.com/?t=101650188300002&r=1&w=2

http://marc.theaimsgroup.com/?w=2&r=1&s=Proposal%3A+C%2B%2B+Language+Binding+for+DOM+L&q=t

4.0 IDOM

=========

4.1 Interface

==========

4.1.1 Features of IDOM Interface

--------------------------------------------------

e.g. virtual IDOM_Element* IDOM_Document::createElement(const XMLCh* tagName) = 0;

1. Define as abstract base classes

2. Use normal C++ pointers.

    => So that abstract base class is possible.

    => Make it more C++ like. Less Java like.

4.1.2 Pros and Cons of IDOM Interface

----------------------------------------------------------

Pros:

1. Abstract base classes that correspond to the W3C DOM interfaces

    => Can be recommended as Apache DOM C++ Binding

    => More standard like, no implementation assumed as they are just abstract interfaces using pure virtual functions

2. (Depends on users' preference)

    - someone prefers C++ like style

Cons:

1. IDOM_XXX - weird prefix 'I'

    Solution:

        - Proposed to rename to DOMXXXX which also matches the DOM Level 3 naming convention

2. (Depends on users' preference)

    - someone does not like pointers, and wants Java-like interface for ease to use, ease to learn and ease to port (from Java).

3. As the old DOM interface has been around for a long time, majority of current Xerces-C++ still uses the old DOM interface, significant migration impact

    Solution:

        - Announce the deprecation of old DOM interface for a couple of releases before removal



4.2 Implementation

===============

4.2.1 Features of IDOM Implementation

-----------------------------------------------------------

1. Use an independent storage allocator per document. The advantage here is that allocation would require no synchronization

    => Fast, good scalability, reduced memory footprint

2. Use plain, null-terminated (XMLCh *) utf-16 strings.

    => No DOMString class overhead which is another performance contributor that makes IDOM faster

4.2.2 Downside of IDOM Implementation

-------------------------------------------------------------

1. Manual memory management

    - If document comes from parser, then parser owns the document. If document comes from DOMImplementation, then users are responsible to delete it.

    Solution:

        - Provide a means of disassociating a document from the parser

        - Add a function "Node::release()", similar to the idea of "Range::detach", which allows users to indicate the release of the Node.

            - From C++ Binding abstract interface perspective, it's up to implementation how to handle this "release()" function.

            - With Xerces-C++ IDOM implementation, the release() function will delete the 'this' pointer if it is a document, else no-op.

2. Memory retained until the document is deleted.

    - If you change the value of an attribute or call removeNode many times, the memory of the old value is not deallocated for reuse and the document grows and grows

    Solution:

        - This in fact is a tradeoff for the fast performance offered by independent storage allocator.

        - There is no immediate good solution in place

5.0 old DOM

==========

5.1 Interface

==========

5.1.1 Features of old DOM Interface

-----------------------------------------------------

e.g. DOM_Element DOM_Document::createElement(const DOMString tagName);

1. Use smart pointers - Java-like

5.1.2 Pros and Cons of old DOM Interface

--------------------------------------------------------------

Pros:

1. DOM_XXX - reasonable name

2. (Depends on users' preference)

    - someone wants Java-like interface for ease to use, ease to learn and ease to port (from Java).

3. Not that many users have migrated to IDOM yet, so migration impact is minimal.

Cons:

1. Not abstract base class

    - Cannot be recommended as Apache DOM C++ Binding

    - Implementation (smart pointer indirection) is assumed

    Solution:

        - This in fact is a tradeoff for the ease of use of smart pointer design

        - No solution.

2. (Depends on users' preference)

    - someone wants C++-like as this is C++ interface



5.2 Implementation

===============

5.2.1 Features of old DOM Implementation

----------------------------------------------------------------

1. Automatic memory management

    - Memory is released when there is no more handles pointing to it

    - Use reference count to keep track of handles

2. Use thread-safe DOMString class

5.2.2 Downside of old DOM Implementation

--------------------------------------------------------------------

1. Performance is slow

    - Memory management is the biggest time consumer, and a lot of memory footprint.

    - There are a whole lot of blocks allocated when creating a document and then freed when finished with it. Each and every node requires at least one and sometimes several separately allocated blocks. DOMString take three. It adds up.

    Solution:

        - Lenny suggests to use IDOM interface internally in DOM implementation, patch in Bugzilla 5967

        - Then the performance benefits of IDOM is gained but the memory retained problem in IDOM implementation still remains to address.

        - And internally, we will have dual interface maintenance model as IDOM interface is then used by DOM internally.

Vote Question:

============

I would like to call for a vote:

    ==> Which INTERFACE should be the Xerces-C++ public supported W3C DOM Interface, DOM or IDOM? <===

Note:

1. The question is asking which "interface" to be officially supported. Once the choice of interface is chosen, we can discuss how to solve the downside of implementation as the next topic.

2. The one being voted will become the ONLY Xerces-C++ supported public W3C DOM Interface, and is where the DOM Level 3 being implemented.

3. The API of the other interface will be deprecated. And its samples, and associated Parser will eventually be removed from the distribution

RE: Call for Vote: which one to be the Xerces-C++ publicsupporte d d W3C DOM interface

Reply via email to