If you store every node as its own record then you will have spatial issues reconstructing documents as there is no guarantee that all the nodes for any particular document will be located in the same area of a disk.
If you do in fact want to store each node separately, you will need to create some sort of container for all the nodes in a document, so that the disk can access the container as one logical data structure. Matt Liotta President & CEO Montara Software, Inc. http://www.montarasoftware.com/ 888-408-0900 x901 > -----Original Message----- > From: James Bates [mailto:[EMAIL PROTECTED] > Sent: Monday, December 23, 2002 6:44 PM > To: [EMAIL PROTECTED] > Subject: core development > > Here's the stuff I think we could get started with in the core development > field: > > 1) storage: > ======= > Currently XML is stored by "compressing it", which in fact involves > flattening the tree into a byte-array, and replacing > all element and attribute names with pointers into a symbol table, and > then > storing the byte array as one data record ("page" or > linked list of pages), indexed by its name in the collection paged file > using a B-Tree. > > This makes for poor scalability, and it also is not particularly > interesting > for optimizing queries. I'd like to propose the following storage scheme: > > We could leave the tree structure of the XML as is. We could then store > every node in its OWN data record in a "Paged file", and have elements > POINT > to their children (i.e. they'd contain the page numbers of the records > containing their child nodes + the actual attributes for example: not all > that much point in storing attributes separately, although this too is > debatable...). We can maintain the symbol table no problem. In fact we > could > improve it by adding pointers to each element node or attribute that has a > symbol table entry as name, allowing fast execution of " / / myelement " > type XPaths. > > The main goal here is to allow for large documents. If nodes are stored > separately, they need not all be read in to memory all the time. The > Xindice > DOM implementation would only load what it needs. > > We can maintain the B-Tree in the same data file for locating the root > nodes > of documents, based on names. > > Indexes could also benefit: instead of pointing to the complete document > that satisfies an XPath, they could point to the actual node that is the > result of the XPath. In this manner, XPath searches in a single large XML > document could also benefit from indexes. > > Same story for updates: selects can resolved as XPath queries (such is the > case already!), and then only the required node(s) need be updated. > > 2) failure recovery: > ============ > This is a necessary prerequisite for transaction support. > > For this, it is necessary to maintain a log of all updates to the > database. > This log can be "checkpointed" at various times, and these > checkpoints can be used by a (still-to-be-written) recovery engine that > can > "rollback" to the last checkpoint. This way, consistensy can be > guaragnteed > for instance. > > There are a few good failure recovery algorithms to be found in any > database > theory book. The issue is always chosing the "data unit": how granular > should the log be? This depends essentially on how fine updates occur. > Using > the storage scheme above, this would be "nodes", which are in fact stored > as > records in a paged file. Chosing records in a paged file as log unit gives > the advantage of protecting not only the XML trees, but also the B-Tree > containing the document names. > > > This groundwork done, we can start thinking about query optimzing and > transaction support. > > It's one scheme, but I'm pretty sure it should work. It should at the very > least be fun to have a go at ;) > > All other views are abviously welcome. > > James