Re: B+-Tree indexing

Andy Seaborne Wed, 24 May 2017 01:53:43 -0700

On 24/05/17 02:33, Dennis Grinwald wrote:
> Hi there,


Hi Dennis,

> I'm very new to the Jena TDB and have several questions about it's

> Architecture. I would be very thankful if you could answer thefollowing :


All Nodes get given a NodeId which is a fixed length 8 byte sequence.

Some Nodes (RDF Terms) like integers, dateTimes, hold the value in theNodeId itself (8 bits types, 56 bits of value) otherwise the NodeIdrecords the location in the node table for the details of the node arerecorded (URIs, strings, out of range integers...).


== Lookup NodeId -> Node

The NodeId is the location in the node table so TDB just fetechs it fromthe file.

There is a large cache as well. This lookup happens on query results soit is important.


== Lookup Node -> NodeId

For this, the Node is hashed (MD5 : 128 bits) and the hash assumed to beunique. 128 bits is a very large number!


There is then a B+Tree with key node hash and value NodeId.

This lookup happens when building the query. It's cached and there is aa "miss" cache as well for nodes not in the database.


== SPO etc

The key is 3 NodeIds - 24 bytes.  There is no value part.

Because the B+Tree has the keys sorted, you can do lookup of S?? to findthe start of all triples with that S (lookup S00), then a range scanforwards until (S+1)00 through the B+Tree leaves. The leaves arechained horizontally. (This is for TDB1 ; not true in TDB2 - the leavesare not linked.)


Similarly, you can lookup SP?, or SPO.

The B+Trees are about order 200 - they use a 8Kbyte block.

1. In "Clustered TDB: A Clustered Triple Store for Jena" published byMr. Seaborn, Owens, Gibbins and Schraefel, it says that the Indexes:SPO, POS, OSP are B+-Trees. My question is: How are the internal nodes(keys) of this B+-Tree represented and what are the values/valuepointers in the leaves of the B+-Tree?

That was using a much older version of TDB but the design is the roughlythe same.

2. On "http://jena.apache.org/documentation/tdb/architecture.html"; itsays, that the Node to NodeID mapping is realized by a B+-Tree. Myquestion as in "1." : How are the internal nodes (keys) of this B+-Treerepresented and what are the values/value pointers in the leaves of theB+-Tree?

That lookup does not need the sorted property of B+Trees, just a akey->value lookup such as a external hash index. But the cache is soimportant anyway, it was simpler to have one datastructure.

It is quite important in TDB that it uses it's own implementation ofB+Trees because they are efficient (in space and time) for the TDB case.A general purpose B+Tree storage package is doing a lot of things thatTDB does not need (e.g reading and writing the bytes of keys withoutcopies for safety). Transactions need to be coordinated across multipleB+Trees so a transactional B+Tree would not be much help unless it had away to integrate into a larger transaction system.

Normally, the B+trees are memory-mapped into the JVM which gives thecaching, using in effect, the OS disk cache.


Hope this description helps,

        Andy



Thank you for your time.


Best regards

Dennis Grinwald

Re: B+-Tree indexing

Reply via email to