On Monday, December 23, 2002, at 04:44 PM, James Bates wrote:

Here's the stuff I think we could get started with in the core development
field:


1) storage:
=======
Currently XML is stored by "compressing it", which in fact involves
flattening the tree into a byte-array, and replacing
all element and attribute names with pointers into a symbol table, and then
storing the byte array as one data record ("page" or
linked list of pages), indexed by its name in the collection paged file
using a B-Tree.


This makes for poor scalability, and it also is not particularly interesting
for optimizing queries. I'd like to propose the following storage scheme:


We could leave the tree structure of the XML as is. We could then store
every node in its OWN data record in a "Paged file", and have elements POINT
to their children (i.e. they'd contain the page numbers of the records
containing their child nodes + the actual attributes for example: not all
that much point in storing attributes separately, although this too is
debatable...). We can maintain the symbol table no problem. In fact we could
improve it by adding pointers to each element node or attribute that has a
symbol table entry as name, allowing fast execution of " / / myelement "
type XPaths.


The main goal here is to allow for large documents. If nodes are stored
separately, they need not all be read in to memory all the time. The Xindice
DOM implementation would only load what it needs.

I completely agree that we need to break the documents up more and I also agree with your reasons for needing to do so, however I'm very hesitant to go all the way to individual nodes. The are several reasons for this. One is that the current BTree is really slow, in fact it's faster to store things using the FSFiler then it is with the BTreeFiler. If we start breaking documents up at that granularity performance is really going to go through the floor. This is especially true if we store one node per page. A typical node will be only a few bytes to a few hundred bytes, but pages are supposed to be the same size as the OS page size (usually 4 or 8K). So we're talking about having to read large numbers of pages for relatively small documents.


Anyway, I've been exploring two things related to this. First I've been looking hard at page management, and the idea of storing one node per page. For the current BTree this is OK because the way it is implemented it will completely fill the pages for index (i.e. internal) nodes before splitting and most data nodes at the leaves are relatively large. If we go to smaller leaf node sizes we need to look at storing more then one BTree node per page. This is probably a good thing to do anyway, since indexes are already storing small leaf nodes. BTW, I also think we should be questioning the use of a BTree for actual record storage. It makes sense for indexes, but for direct key based retrievals it may not be the best thing.

The second thing I'm exploring in this area, is a different way to partition an XML document into pieces. I've been exploring horizontal partitioning where all sibling nodes would be stored together in a single record. In a lot of ways this is pretty similar to how relational databases work, i.e. each set of sibling nodes represents a tuple within a relation, and the entire document is made up through joins between relations. In our case our relations will only have one tuple, but I'm thinking the general concept is the same. Relational databases don't store things at the attribute level, they store full tuples together as a unit. We'd still use the compression format for the sibling level and just extend it so that each node can find the proper BTree node with its children in it. Anyway, I've done a fair amount of exploratory work on this including some ideas for structural and content based indexes, but I need more time before I'm willing to say whether it's a good idea or not. So far I do kind of like it.

So far I also see two cases where it will break down, one is where you have a large flat document and the other is where you have a deep hierarchical document with one child node for each node. In the first case you get a really large list of sibling nodes and in the second you get a deep hierarchy that is basically the same as storing all nodes individually.



We can maintain the B-Tree in the same data file for locating the root nodes
of documents, based on names.


Indexes could also benefit: instead of pointing to the complete document
that satisfies an XPath, they could point to the actual node that is the
result of the XPath. In this manner, XPath searches in a single large XML
document could also benefit from indexes.


Same story for updates: selects can resolved as XPath queries (such is the
case already!), and then only the required node(s) need be updated.


2) failure recovery:
============
This is a necessary prerequisite for transaction support.

For this, it is necessary to maintain a log of all updates to the database.
This log can be "checkpointed" at various times, and these
checkpoints can be used by a (still-to-be-written) recovery engine that can
"rollback" to the last checkpoint. This way, consistensy can be guaragnteed
for instance.


There are a few good failure recovery algorithms to be found in any database
theory book. The issue is always chosing the "data unit": how granular
should the log be? This depends essentially on how fine updates occur. Using
the storage scheme above, this would be "nodes", which are in fact stored as
records in a paged file. Chosing records in a paged file as log unit gives
the advantage of protecting not only the XML trees, but also the B-Tree
containing the document names.

Related to this I've been exploring a versioning technique for implementing transactions. This is described very briefly in the book you mentioned "Fundamentals of Database Systems" and there are a number of research papers published on it. This is probably a really good technique from our perspective since we want versioning in the core anyway. We of course still need to have a recovery log, but I think using a versioning technique might allow us to track smaller amounts of information in the log. i.e. we could write the actual data values to the repository as a new version without updating the version history for the node and then just record the version history change in the log before committing. Anyway, just wanted to give you some idea of what I'm exploring, there's still a ton to learn here. Picked up a book "Recovery Mechanisms in Database Systems" from the library a few days ago, haven't really gotten into it yet, but I'll let you know if it's worth reading.


How comfortable are you with the recovery algorithms you mentioned?

BTW, is the book on database implementation you mentioned worth picking up? I've been looking at that book for a long time and have always balked at the price. I've been trying to find a copy locally to skim it, but neither the UofA library or any bookstores around here have it.



This groundwork done, we can start thinking about query optimzing and
transaction support.

It's one scheme, but I'm pretty sure it should work. It should at the very
least be fun to have a go at ;)


All other views are abviously welcome.

James



Kimbro Staken
Java and XML Software, Consulting and Writing http://www.xmldatabases.org/
Apache Xindice native XML database http://xml.apache.org/xindice
XML:DB Initiative http://www.xmldb.org




Reply via email to