Here's the stuff I think we could get started with in the core development field:
1) storage: ======= Currently XML is stored by "compressing it", which in fact involves flattening the tree into a byte-array, and replacing all element and attribute names with pointers into a symbol table, and then storing the byte array as one data record ("page" or linked list of pages), indexed by its name in the collection paged file using a B-Tree. This makes for poor scalability, and it also is not particularly interesting for optimizing queries. I'd like to propose the following storage scheme: We could leave the tree structure of the XML as is. We could then store every node in its OWN data record in a "Paged file", and have elements POINT to their children (i.e. they'd contain the page numbers of the records containing their child nodes + the actual attributes for example: not all that much point in storing attributes separately, although this too is debatable...). We can maintain the symbol table no problem. In fact we could improve it by adding pointers to each element node or attribute that has a symbol table entry as name, allowing fast execution of " / / myelement " type XPaths. The main goal here is to allow for large documents. If nodes are stored separately, they need not all be read in to memory all the time. The Xindice DOM implementation would only load what it needs. We can maintain the B-Tree in the same data file for locating the root nodes of documents, based on names. Indexes could also benefit: instead of pointing to the complete document that satisfies an XPath, they could point to the actual node that is the result of the XPath. In this manner, XPath searches in a single large XML document could also benefit from indexes. Same story for updates: selects can resolved as XPath queries (such is the case already!), and then only the required node(s) need be updated. 2) failure recovery: ============ This is a necessary prerequisite for transaction support. For this, it is necessary to maintain a log of all updates to the database. This log can be "checkpointed" at various times, and these checkpoints can be used by a (still-to-be-written) recovery engine that can "rollback" to the last checkpoint. This way, consistensy can be guaragnteed for instance. There are a few good failure recovery algorithms to be found in any database theory book. The issue is always chosing the "data unit": how granular should the log be? This depends essentially on how fine updates occur. Using the storage scheme above, this would be "nodes", which are in fact stored as records in a paged file. Chosing records in a paged file as log unit gives the advantage of protecting not only the XML trees, but also the B-Tree containing the document names. This groundwork done, we can start thinking about query optimzing and transaction support. It's one scheme, but I'm pretty sure it should work. It should at the very least be fun to have a go at ;) All other views are abviously welcome. James