Lenny, Yes I am reviewing that together as well.
I was stuck in the memory management and then side-track by many other stuff in the last month and thus didn't carry on the investigation since my last post. Since we couldn't start our DOM L3 development until we have resolved this issue, I must dedicate myself to look into this first in the next couple of weeks. I will post in the mailing list once I have a better idea. Thanks! Tinny ----- Original Message ----- From: "Lenny Hoffman" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, April 15, 2002 4:01 PM Subject: RE: DOM 3 Patches? Hi Tinny, How is your review of the DOM-IDOM integration going? Remember that has an impact on the decision to standardize on IDOM as the Xerces implementation. Also, when you last posted that you were considering on standardizing on IDOM there was quite a bit of discussion regarding the danger of going to a fixed memory model that I don't remember you commenting on. There are also open issues regarding serious memory leaks with the current IDOM that have not been addressed; specifically that IDOM does not release any allocated memory until the owning document is deleted, which leads to unlimited growth when performing common operations like changing attribute values and adding and removing elements. I have been working on a write up that describes my view of the Xerces DOM, it is not complete yet, but since I haven't heard from you on your position, I have included it below so that you and anyone else interested can comment. ------------------------------------------------------------ Xerces DOM Redesign Background The W3C has a recommendation for a standard DOM, but they did not provided a recommendation for how the C++ language should bind to the DOM like they did for Java. Thus, C++ bindings are free to provide any type of interface they see fit. The first approach taken by the Xerces project was to emulate the Java binding, which offered several benefits: � Those familiar with the Java DOM binding would find it easy to learn and use the C++ DOM. � Memory management is hidden from C++ DOM users, just as it is for Java DOM users. The solution chosen for the memory management problem was to utilize the handle/body pattern and use reference counting to know when a node body is no longer needed. A node body is no longer needed when: 1. No more handles are pointing to it. 2. It has no parent node. In other words it is no longer part of a document. The document node is treated specially and is no longer needed when: 1. No more handles are pointing to it. 2. None of its owned nodes have any handles pointing to them. Nodes not part of a document are deleted as soon as there are no handles using them any longer, i.e. the client is done with them. Nodes directly and indirectly owned by a document node and that document node are deleted as soon no handles point to any of them. The combination of these two policies ensures that no reachable nodes are deleted, and that they are deleted as soon as they become unreachable. Some found the performance of the DOM to be less than they hoped for from a C++ DOM implementation, and devised an alternative approach named IDOM. For the purposes of this discussion, the original approach described above will be referred to as DOM. It was thought that reference counting was incurring a large performance hit, and developers of IDOM abandoned the reference counting in favor of the following policies: 1. All nodes that are created by the document are owned by that document and are not deleted until the document itself is deleted. 2. If the document were obtained from the IDOM_Parser, then the parser manages the document's lifetime. 3. If the document were obtained via IDOM_DOMImplementation, then the user is required to manage the document's lifetime, i.e. delete it when done with it. In addition to the new memory policy, the IDOM_Document was made into its own heap manager for its owned nodes, which meant that upon document deletion, many individual node deletions are avoided and instead a few blocks are returned back to the system. More related to feel than to performance, the IDOM got rid of the handle/body pattern and instead return direct pointers to nodes for clients to work with. A similar thing was done with strings, a direct XMLCh pointer is returned from nodes instead of a DOMString object. Current situation: The current situation is that both DOM and IDOM options are made available to Xerces users, with the IDOM deemed experimental and subject to change. This duality, while useful in the short term as an experiment, is harmful if left around too long, as it is not clear to users which is best to use, and to developers which is best to extend with features from DOM level 3, and so on. Going forward: One approach to solving the duality is to eliminate the DOM interfaces in favor of the IDOM interfaces. While this is seems attractive from a performance standpoint, there are many drawbacks: � Xerces becomes fixed to the IDOM memory model. The IDOM returns direct pointers to elements and strings to users, and with direct pointers there is no way to know how long the pointer is in use. The IDOM's solution to this problem is to adopt a policy of keeping all elements and strings in memory so long as the owning document is alive. Other memory models, such as those that cache unused node on disk, and/or compress them, and so on, become impossible to implement because of the lack of knowledge of when a node is in use and when it is not. � Backward compatibility with DOM is lost. The DOM interfaces have been around for a long time as the official Xerces interface, and moving to IDOM as the official interface will force existing DOM users to make many changes to their application. � Some similarity with the Java version of Xerces is lost. This similarity reduces the learning curve for those that move from the Java Xerces to the C++ Xerces for performance or other reasons. � Users are drawn into managing the IDOM memory model. If they get a document from the parser, then they need to keep the parser around as long as they use the document. If they get the document from the IDOM_DOMImplementation interface, then they are responsible for deleting it. If they get an IDOM_DocumentType from the IDOM_DOMImplementation interface, then they are again responsible for deleting it. While it is common for C++ users to be drawn into managing memory, ease of use is adversely affected (which is why so many patterns and patterns that remove this responsibility exist); the relative sizes of the DOM and IDOM user guides illustrate this, the IDOM user guide has to spend a great deal of time explaining how to manage memory that the DOM guide simply doesn't. � There is currently a serious memory leak (bug 7645) which even when fixed will mean that users are further drawn into managing the IDOM memory model. The leak occurs because once a node has been added to the document it is never deleted from its storage pool, even when removed. The first part of fixing this problem is to provide an overloaded delete operator that removes nodes from the storage pool to balance the overloaded new operator used to place nodes in the storage pool. The second part is to further expand the IDOM user guide to inform users that they must manually delete any removed nodes that they are done with. Another approach to solving the duality is to abandon the experimental IDOM altogether, but this is not attractive, as we don't want to loose its performance benefits. Alas we need a new approach; one that: � Is as backwards compatible with the current DOM as possible. � Does not dictate a particular memory model or DOM implementation. � For best performance given general use, uses the IDOM implementation as the default implementation. � Retains the IDOM performance improvement. � Does not leak memory. DOM-IDOM Integration: I recently submitted enhancement request 5967 (DOM-IDOM Integration), which has attached all the changes needed for a new approach that meets these goals. The approach has been evolving and maturing, and this write-up aims at collecting the various scraps I have written about the changes into one place and fill in gaps with the hope that doing so will encourage adoption. The idea behind the DOM-IDOM integration was to merge DOM's use of the handle/body pattern with IDOM's implementation. Because the new design aims at supporting any number of alternative body implementations, the IDOM implementation is not made the implementation, rather it is setup as the default implementation, and other implementations can be substituted without affecting clients of the DOM handles. Use of the handle/body pattern is crucial to meeting our goals; with the handle/body pattern, the specific implementation used for the body is hidden from users, who only work with handles. Furthermore, when a handle points to a body it represents current use of the body, the knowledge of which different implementations can use as they need. For example, while the current IDOM implementation keeps all of a document's nodes in memory (which can be a scalability problem), an alternative implementation can retrieve nodes from disk when needed and return them when no longer needed (solving the scalability problem). With the IDOM implementation used as the default implementation, a well performing DOM is provided for those that can fit their entire documents in memory. The existing DOM handle classes were sufficient for use as the new handle classes, so I kept them (this also assured meeting the goal of backwards compatibility for users of the DOM interfaces). The existing DOM body classes that the handle classes used, though, were the specific DOM implementation classes, and not abstract base classes that represent the required interface that any implementation must meet. This meant that the DOM body classes where unsuitable for meeting the goal of having pluggable implementations, and thus was unsuitable for the new design. The IDOM, on the other hand, did have abstract base classes for each of the node types, which along with the goal of having the IDOM implementation be the default implementation made the IDOM abstract base classes ideal for the body base classes. Assuming that the IDOM implementation was better suited for the default implementation, I discarded the DOM implementation classes. If later desired, though, the DOM implementation classes could be adapted to derive from the new body base classes (the IDOM abstract base classes) and become an alternative implementation. Some informal testing that I have done found DOM to outperform IDOM in some circumstances (mainly with large documents), so this may actually be desirable. Handles communicate to bodies that they are using them by calling addRef on the body upon usage start and removeRef upon usage end. These are virtual methods on the IDOM_Node abstract base class and can be overridden and used by some implementations, and ignored by others. Default IDOM implementation reference counting: The new design aims at avoiding drawing users into maintaining a specific implementation's memory model, as is currently done with the IDOM. To do this the IDOM implementation must be modified to utilize reference counting. By wait you say, wasn't reference counting one of the performance problems that the IDOM was designed to solve. Well, yes and no. Here is an excerpt from the IDOM user manual: The C++ IDOM implementation no longer uses reference counting for automatic memory management. The C++ IDOM uses an independent storage allocator per document. The storage for a DOM document is associated with the document node object. The advantage here is that allocation would require no synchronization in most cases (based on the same threading model that we have now - one thread active per document, but any number of documents running in parallel with separate threads). The allocator does not support a delete operation at all - all allocated memory would persist for the life of the document, and then the larger blocks would be returned to the system without separately deleting all of the individual nodes and strings within the document. The performance benefit the IDOM provides is gained by utilization of a document owned storage allocator, which does not require synchronization like the general heap manager does. Note that reference counting alone is not a problem. Think about it, compared to everything else that is done when using the DOM, simply avoiding incrementing and decrementing reference counters will have negligible effect on performance. As it turns out, reference counting a useful component to solving one of IDOM's biggest problems, that of leaking memory. The problem is that the memory for any removed nodes are not released until the document is destroyed. The document's storage allocator needs to be updated to allow reclaiming memory of removed nodes (by adding an overloaded operator delete to balance out its overloaded operator new), and reference counting can make it easy to know when to call delete on nodes. The policy used by the original DOM will suffice for the new design: A node body is no longer needed when: 1. No more handles are pointing to it. 2. It has no parent node. In other words it is no longer part of a document. The document node is treated specially and is no longer needed when: 1. No more handles are pointing to it. 2. None of its owned nodes have any handles pointing to them. IDOM implementation changes: 1. Add an overloaded operator delete to balance the overloaded operator new provided by IDDocumentImp. 2. Add reference counters to IDNodeImp and IDDocumentImp. -----Original Message----- From: Tinny Ng [mailto:[EMAIL PROTECTED]] Sent: Monday, April 15, 2002 2:48 PM To: [EMAIL PROTECTED] Subject: Re: DOM 3 Patches? Jason. > Are you willing to accept patches for DOM level 3 implementations Yes, but one second ... Remember sometimes ago I've post a Apache C++ DOM Binding proposal? I am now reviewing the comment and trying to reorganize the IDOM (e.g. rename IDOM_DOMXXX to DOMXXX as discussed, this also matches the DOM L3 naming convention (e.g. DOMBuilder, DOMErrorHandler ... etc.)). Give me a few more days, and I will post the prototype in the mailing list for review. Then you can submit your patch based on this new prototype. Tinny ----- Original Message ----- From: "Jason E. Stewart" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, April 15, 2002 1:30 PM Subject: DOM 3 Patches? > Hey Tinny et. al., > > Are you willing to accept patches for DOM level 3 implementations in > IDOM? I'd really like to add support for the new 'encoding', > 'version', and 'standalone' attributes of DOM_Document. That way I can > handle XML declarations properly. > > What say you? > > Thanks, > jas. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
