Re: Estimating TDB2 size

Andy Seaborne Sun, 26 Nov 2017 04:31:29 -0800

Every RDFTerm gets a NodeId in TDB.  A triple is 3 NodeIds.


There is a big cache, NodeId->RDFTerm.

In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an intand long (96 bits) - the current implementation is using 64 bits.

It is very common as a design to dictionary (intern) terms because joinscan be done by comparing a integers, not testing whether two strings arethe same, which is much more expensive.


In addition TDBx inlines numbers integers and date/times and some others.

https://jena.apache.org/documentation/tdb/architecture.html

TDBx could, but doesn't, store compressed data on disk. There are prosand cons of this.


    Andy

On 26/11/17 08:30, Laura Morales wrote:

Perhaps a bit tangential but this is somehow related to how HDT stores its data (I've run 
some tests with Fuseki + HDT store instead of TDB). Basically, they assign each subject, 
predicate, and object an integer value. It keeps an index to map integers with the 
corresponding string (of the original value), and then they store every triple using 
integers instead of strings (something like "1 2 9. 8 2 1 ." and so forth. The 
drawback I think is that they have to translate indices/strings back and forth at each 
query, nonetheless the response time is still impressive (milliseconds), and it 
compresses the original file *a lot*. By a lot I mean that for Wikidata (not the full 
file though, but one with about 2.3 billion triples) the HDT is more or less 40GB, and 
gz-compressed about 10GB. The problem is that their rdf2hdt tool is so inefficient that 
it does everything in RAM, so to convert something like wikidata you'd need at least a 
machine with 512GB of ram (or swap if you have a fast enough swap :D). Also the tool 
looks like it can't handle files with more than 2^32 triples, although HDT (the format) 
does handle them. So as long as you can handle the conversion, if you want to save space 
you could benefit from using a HDT store rather than using TDB.



Sent: Sunday, November 26, 2017 at 5:30 AM
From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>
To: users@jena.apache.org
Subject: Re: Estimating TDB2 size
i have   specific questiosn in relation to what ajs6f said:

i have a TDB store with 1/3 triples with very small literals (3-5 char),
where the same sequence is often repeated. would i get smaller store and
better performance if these were URI of the character sequence (stored
once for each repeated case)? any guess how much I could improve?

does the size of the URI play a role in the amount of storage used. i
observe that i have for 33 M triples a TDB size (files) of 13 GB, which
means about 300 byte per triple. the literals are all short (very seldom
more than 10 char, mostly 5 - words from english text). is is a named
graph, if this makes a difference.

thank you!

andrew

Re: Estimating TDB2 size

Reply via email to