On 24/05/17 02:33, Dennis Grinwald wrote:
> Hi there,
Hi Dennis,
> I'm very new to the Jena TDB and have several questions about it's
> Architecture. I would be very thankful if you could answer the
following :
All Nodes get given a NodeId which is a fixed length 8 byte sequence.
Some Nodes (RDF Terms) like integers, dateTimes, hold the value in the
NodeId itself (8 bits types, 56 bits of value) otherwise the NodeId
records the location in the node table for the details of the node are
recorded (URIs, strings, out of range integers...).
== Lookup NodeId -> Node
The NodeId is the location in the node table so TDB just fetechs it from
the file.
There is a large cache as well. This lookup happens on query results so
it is important.
== Lookup Node -> NodeId
For this, the Node is hashed (MD5 : 128 bits) and the hash assumed to be
unique. 128 bits is a very large number!
There is then a B+Tree with key node hash and value NodeId.
This lookup happens when building the query. It's cached and there is a
a "miss" cache as well for nodes not in the database.
== SPO etc
The key is 3 NodeIds - 24 bytes. There is no value part.
Because the B+Tree has the keys sorted, you can do lookup of S?? to find
the start of all triples with that S (lookup S00), then a range scan
forwards until (S+1)00 through the B+Tree leaves. The leaves are
chained horizontally. (This is for TDB1 ; not true in TDB2 - the leaves
are not linked.)
Similarly, you can lookup SP?, or SPO.
The B+Trees are about order 200 - they use a 8Kbyte block.
1. In "Clustered TDB: A Clustered Triple Store for Jena" published by
Mr. Seaborn, Owens, Gibbins and Schraefel, it says that the Indexes:
SPO, POS, OSP are B+-Trees. My question is: How are the internal nodes
(keys) of this B+-Tree represented and what are the values/value
pointers in the leaves of the B+-Tree?
That was using a much older version of TDB but the design is the roughly
the same.
2. On "http://jena.apache.org/documentation/tdb/architecture.html" it
says, that the Node to NodeID mapping is realized by a B+-Tree. My
question as in "1." : How are the internal nodes (keys) of this B+-Tree
represented and what are the values/value pointers in the leaves of the
B+-Tree?
That lookup does not need the sorted property of B+Trees, just a a
key->value lookup such as a external hash index. But the cache is so
important anyway, it was simpler to have one datastructure.
It is quite important in TDB that it uses it's own implementation of
B+Trees because they are efficient (in space and time) for the TDB case.
A general purpose B+Tree storage package is doing a lot of things that
TDB does not need (e.g reading and writing the bytes of keys without
copies for safety). Transactions need to be coordinated across multiple
B+Trees so a transactional B+Tree would not be much help unless it had a
way to integrate into a larger transaction system.
Normally, the B+trees are memory-mapped into the JVM which gives the
caching, using in effect, the OS disk cache.
Hope this description helps,
Andy
Thank you for your time.
Best regards
Dennis Grinwald