On 24/05/17 02:33, Dennis Grinwald wrote:
> Hi there,

Hi Dennis,

> I'm very new to the Jena TDB and have several questions about it's
> Architecture. I would be very thankful if you could answer the following :

All Nodes get given a NodeId which is a fixed length 8 byte sequence.

Some Nodes (RDF Terms) like integers, dateTimes, hold the value in the NodeId itself (8 bits types, 56 bits of value) otherwise the NodeId records the location in the node table for the details of the node are recorded (URIs, strings, out of range integers...).

== Lookup NodeId -> Node

The NodeId is the location in the node table so TDB just fetechs it from the file.

There is a large cache as well. This lookup happens on query results so it is important.

== Lookup Node -> NodeId

For this, the Node is hashed (MD5 : 128 bits) and the hash assumed to be unique. 128 bits is a very large number!

There is then a B+Tree with key node hash and value NodeId.

This lookup happens when building the query. It's cached and there is a a "miss" cache as well for nodes not in the database.

== SPO etc

The key is 3 NodeIds - 24 bytes.  There is no value part.

Because the B+Tree has the keys sorted, you can do lookup of S?? to find the start of all triples with that S (lookup S00), then a range scan forwards until (S+1)00 through the B+Tree leaves. The leaves are chained horizontally. (This is for TDB1 ; not true in TDB2 - the leaves are not linked.)

Similarly, you can lookup SP?, or SPO.

The B+Trees are about order 200 - they use a 8Kbyte block.


1. In "Clustered TDB: A Clustered Triple Store for Jena" published by Mr. Seaborn, Owens, Gibbins and Schraefel, it says that the Indexes: SPO, POS, OSP are B+-Trees. My question is: How are the internal nodes (keys) of this B+-Tree represented and what are the values/value pointers in the leaves of the B+-Tree?

That was using a much older version of TDB but the design is the roughly the same.

2. On "http://jena.apache.org/documentation/tdb/architecture.html"; it says, that the Node to NodeID mapping is realized by a B+-Tree. My question as in "1." : How are the internal nodes (keys) of this B+-Tree represented and what are the values/value pointers in the leaves of the B+-Tree?

That lookup does not need the sorted property of B+Trees, just a a key->value lookup such as a external hash index. But the cache is so important anyway, it was simpler to have one datastructure.

It is quite important in TDB that it uses it's own implementation of B+Trees because they are efficient (in space and time) for the TDB case. A general purpose B+Tree storage package is doing a lot of things that TDB does not need (e.g reading and writing the bytes of keys without copies for safety). Transactions need to be coordinated across multiple B+Trees so a transactional B+Tree would not be much help unless it had a way to integrate into a larger transaction system.

Normally, the B+trees are memory-mapped into the JVM which gives the caching, using in effect, the OS disk cache.

Hope this description helps,

        Andy



Thank you for your time.


Best regards

Dennis Grinwald


Reply via email to