RE: core development

Matt Liotta 24 Dec 2002 00:34:36 -0000

If you store every node as its own record then you will have spatial
issues reconstructing documents as there is no guarantee that all the
nodes for any particular document will be located in the same area of a
disk.


If you do in fact want to store each node separately, you will need to
create some sort of container for all the nodes in a document, so that
the disk can access the container as one logical data structure.

Matt Liotta
President & CEO
Montara Software, Inc.
http://www.montarasoftware.com/
888-408-0900 x901

> -----Original Message-----
> From: James Bates [mailto:[EMAIL PROTECTED]
> Sent: Monday, December 23, 2002 6:44 PM
> To: [EMAIL PROTECTED]
> Subject: core development
> 
> Here's the stuff I think we could get started with in the core
development
> field:
> 
> 1) storage:
> =======
> Currently XML is stored by "compressing it", which in fact involves
> flattening the tree into a byte-array, and replacing
> all element and attribute names with pointers into a symbol table, and
> then
> storing the byte array as one data record ("page" or
> linked list of pages), indexed by its name in the collection paged
file
> using a B-Tree.
> 
> This makes for poor scalability, and it also is not particularly
> interesting
> for optimizing queries. I'd like to propose the following storage
scheme:
> 
> We could leave the tree structure of the XML as is. We could then
store
> every node in its OWN data record in a "Paged file", and have elements
> POINT
> to their children (i.e. they'd contain the page numbers of the records
> containing their child nodes + the actual attributes for example: not
all
> that much point in storing attributes separately, although this too is
> debatable...). We can maintain the symbol table no problem. In fact we
> could
> improve it by adding pointers to each element node or attribute that
has a
> symbol table entry as name, allowing fast execution of " / / myelement
"
> type XPaths.
> 
> The main goal here is to allow for large documents. If nodes are
stored
> separately, they need not all be read in to memory all the time. The
> Xindice
> DOM implementation would only load what it needs.
> 
> We can maintain the B-Tree in the same data file for locating the root
> nodes
> of documents, based on names.
> 
> Indexes could also benefit: instead of pointing to the complete
document
> that satisfies an XPath, they could point to the actual node that is
the
> result of the XPath. In this manner, XPath searches in a single large
XML
> document could also benefit from indexes.
> 
> Same story for updates: selects can resolved as XPath queries (such is
the
> case already!), and then only the required node(s) need be updated.
> 
> 2) failure recovery:
> ============
> This is a necessary prerequisite for transaction support.
> 
> For this, it is necessary to maintain a log of all updates to the
> database.
> This log can be "checkpointed" at various times, and these
> checkpoints can be used by a (still-to-be-written) recovery engine
that
> can
> "rollback" to the last checkpoint. This way, consistensy can be
> guaragnteed
> for instance.
> 
> There are a few good failure recovery algorithms to be found in any
> database
> theory book. The issue is always chosing the "data unit": how granular
> should the log be? This depends essentially on how fine updates occur.
> Using
> the storage scheme above, this would be "nodes", which are in fact
stored
> as
> records in a paged file. Chosing records in a paged file as log unit
gives
> the advantage of protecting not only the XML trees, but also the
B-Tree
> containing the document names.
> 
> 
> This groundwork done, we can start thinking about query optimzing and
> transaction support.
> 
> It's one scheme, but I'm pretty sure it should work. It should at the
very
> least be fun to have a go at ;)
> 
> All other views are abviously welcome.
> 
> James

RE: core development

Reply via email to