jbates 2002/12/03 06:21:20
Modified: src/documentation/content/xdocs/dev guide-internals.xml Added: src/documentation/resources/images element.png element.xcf Log: Started Compressed DOM chapter in Internals Guide Revision Changes Path 1.3 +96 -1 xml-xindice/src/documentation/content/xdocs/dev/guide-internals.xml Index: guide-internals.xml =================================================================== RCS file: /home/cvs/xml-xindice/src/documentation/content/xdocs/dev/guide-internals.xml,v retrieving revision 1.2 retrieving revision 1.3 diff -u -r1.2 -r1.3 --- guide-internals.xml 26 Nov 2002 09:20:42 -0000 1.2 +++ guide-internals.xml 3 Dec 2002 14:21:20 -0000 1.3 @@ -286,6 +286,12 @@ further, the start of the page's data.</p> <section> <title>3.1.1. Paged file header</title> + <p>The paged file header consists of a number of fixed-length fields. + Fields which are longer than one byte, are <em>always</em> stored in + Big Endian format, which means the most significant byte is written at the + lowest address. This is regardless of the type of architecture the server + process is running on, so your data files are portable between + architectures.</p> <figure src="/images/pagedfilehdr.png" alt="File header structure"/> <p>The meaning of the various fields in the file header, whose structure is shown above, is as follows:</p> @@ -349,11 +355,100 @@ </section> <section> <title>4. XML storage</title> + <p>As we saw in the preceding chapter, the B+-Tree file format allows for the + efficient storage of (name, value) pairs. In this chapter we concern ourselves + with using such a (name,value) storage facility to store the XML content of all + XML documents in a collection.</p> + <p>The principle Xindice uses is deceptively simple here: for every XML <em>document</em>, + Xindice will calculate something called the <em>compressed DOM</em>. This is an array of bytes + which can be used to reconstruct the complete XML document at any time. An XML document is + then stored as a (name,value) pair in the B-Tree, where the name is the name given to the XML document, + and the value is the calculated Compressed DOM.</p> + <p>The remaining mechanism to investigate is thus how to construct the Compressed DOM + of a document.</p> <section> <title>4.1. The symbol tables</title> + <p>In order to store the XML content in a space-efficient manner, Xindice uses + something called a <em>Symbol table</em>. This is an XML file which associates + a 16-bit number with any (QName,namespace URI) pair used as element or attribute name + in XML <em>all</em> XML files stored in a collection. (i.e. there is <em>one</em> + symbol table per collection).</p> + <p>Consider the following XML document, to be added to a Xindice collection:</p> +<source><![CDATA[ +<?xml version="1.0"?> +<p:person xmlns:p="http://www.xindice.org/Examples/PersonData" + gender="female" + xml:lang="fr"> + <p:first-name>Susanne</p:first-name> + <p:last-name>Carpentier</p:last-name> + <p:e-mail active="yes">[EMAIL PROTECTED]</p:e-mail> +</p:person> +]]></source> + <p>When this document is stored into an empty Xindice collection, the following + symbol table is created:</p> +<source><![CDATA[ +<?xml version="1.0"?> +<?xindice-class org.apache.xindice.xml.SymbolTable?> +<symbols> + <symbol name="p:first-name" nsuri="http://www.xindice.org/Examples/PersonData" id="4" /> + <symbol name="p:e-mail" nsuri="http://www.xindice.org/Examples/PersonData" id="6" /> + <symbol name="p:last-name" nsuri="http://www.xindice.org/Examples/PersonData" id="5" /> + <symbol name="gender" id="2" /> + <symbol name="xml:lang" id="3" /> + <symbol name="p:person" nsuri="http://www.xindice.org/Examples/PersonData" id="0" /> + <symbol name="active" id="7" /> + <symbol name="xmlns:p" nsuri="http://www.w3.org/2000/xmlns/" id="1" /> +</symbols> +]]></source> + <p>As you can see, the symbol table is itself an XML document which contains + an element for every (QName, namespace URI) pair used in element and attribute + names in the XML documents of the collection. The <code>id</code> attribute is + the 16-bit number that Xindice has assigned to the (QName, namespace URI) pair.</p> + <p>As more documents are added to the + collection using different element and attribute names, entries are added to the + collection's symbol table.</p> + <p>Usually, a collections's symbol table is stored as any other XML document in + the Xindice database. All symbol tables stored in Xindice are in the + <code>system/SysSymbols</code> collection using as name the path of the collection, + with underscores (_) subsituted for the /'s in the collection path.</p> + <p>Being a collection in Xindice, <code>system/SysSymbols</code> itself has + a symbol table too. It is:</p> +<source><![CDATA[ +<symbols> + <symbol name="symbols" id="0" /> + <symbol name="symbol" id="1" /> + <symbol name="name" id="2" /> + <symbol name="id" id="3" /> + <symbol name="nsuri" id="4" /> +</symbols> +]]></source> + <p>Normally, this symbol table should be stored in an XML document named + <code>system_SysSymbols</code> in the <code>system/SysSymbols</code> + collection. Doing so however would create an endless loop, as + <code>system/SysSymbols</code>'s symbol table is needed to read itself! + This particular symbol table is therefore hardcoded into the Xindice + source code.</p> + <p>For any other collection, you can always request the symbol table + yourself by issuing the Xindice command-line invocation:</p> +<source>xindice rd -c /db/system/SysSymbols -n [your_collection_path]</source> </section> <section> <title>4.2. The Compressed DOM</title> + <p>Now that we understand symbol tables, we can take a look at the way in + which Xindice generated a byte string from any given XML document.</p> + <p>The trick is to understand that Xindice simply runs through the XML document + recursively, building a byte sequence for a particular node in the tree + representation of the XML. This will contain the byte data for the children + of the node, and these sub-sequences contain the data for their children etc...</p> + <p>Xindice thus starts by generating the byte sequence for the document node, which + will set off generation for the whole XML document.</p> + <section> + <title>4.2.1. Element nodes</title> + <p>An element node is encoded as shown in the diagram below:</p> + <figure src="images/element.png" alt="Element compressed DOM format"/> + </section> + + </section> </section> <section> 1.1 xml-xindice/src/documentation/resources/images/element.png <<Binary file>> 1.1 xml-xindice/src/documentation/resources/images/element.xcf <<Binary file>>