Re: Estimating TDB2 size

ajs6f Sun, 26 Nov 2017 08:20:32 -0800

You have to start with the understanding that the indexes in a database are not 
the same thing nor for the same purpose as a simple file of triples or quads. 
TDB1 and 2 store the same triple several times in different orders (SPO, OPS, 
etc.) in order to be able to answer arbitrary queries with good performance, 
which is a common technique.


There is no reason to expect a database that is capable of answering arbitrary 
queries with good performance to be as small as a file, which is not.

ajs6f

> On Nov 26, 2017, at 11:17 AM, Andrew U. Frank <fr...@geoinfo.tuwien.ac.at> 
> wrote:
> 
> thank you for the explanations:
> 
> to laura: i guess HDT would reduce the size of my files considerably. where 
> could i find information how to use fuseki with HDT? i might be worth trying 
> and see how response time changes.
> 
> to andy: am i correct to understand that a triple (uri p literal) is 
> translated in two triples (uri p uriX) and a second one (uriX s literal) for 
> some properties p and s? is there any reuse of existing literals? that would 
> give for each literal triple approx. 60 bytes?
> 
> i still do not undestand how a triple needs about 300 bytes of storage? (or 
> how an nt.gzip file of 219 M igives a TDB database of 13 GB)
> 
> size of the database is of concern to me and I think it influences 
> performance through the use of IO time.
> 
> thank you all very much for the clarifications!
> 
> andrew
> 
> 
> 
> On 11/26/2017 07:30 AM, Andy Seaborne wrote:
>> Every RDFTerm gets a NodeId in TDB.  A triple is 3 NodeIds.
>> 
>> There is a big cache, NodeId->RDFTerm.
>> 
>> In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int and 
>> long (96 bits) - the current implementation is using 64 bits.
>> 
>> It is very common as a design to dictionary (intern) terms because joins can 
>> be done by comparing a integers, not testing whether two strings are the 
>> same, which is much more expensive.
>> 
>> In addition TDBx inlines numbers integers and date/times and some others.
>> 
>> https://jena.apache.org/documentation/tdb/architecture.html
>> 
>> TDBx could, but doesn't, store compressed data on disk. There are pros and 
>> cons of this.
>> 
>>     Andy
>> 
>> On 26/11/17 08:30, Laura Morales wrote:
>>> Perhaps a bit tangential but this is somehow related to how HDT stores its 
>>> data (I've run some tests with Fuseki + HDT store instead of TDB). 
>>> Basically, they assign each subject, predicate, and object an integer 
>>> value. It keeps an index to map integers with the corresponding string (of 
>>> the original value), and then they store every triple using integers 
>>> instead of strings (something like "1 2 9. 8 2 1 ." and so forth. The 
>>> drawback I think is that they have to translate indices/strings back and 
>>> forth at each query, nonetheless the response time is still impressive 
>>> (milliseconds), and it compresses the original file *a lot*. By a lot I 
>>> mean that for Wikidata (not the full file though, but one with about 2.3 
>>> billion triples) the HDT is more or less 40GB, and gz-compressed about 
>>> 10GB. The problem is that their rdf2hdt tool is so inefficient that it does 
>>> everything in RAM, so to convert something like wikidata you'd need at 
>>> least a machine with 512GB of ram (or swap if you have a fast enough swap 
>>> :D). Also the tool looks like it can't handle files with more than 2^32 
>>> triples, although HDT (the format) does handle them. So as long as you can 
>>> handle the conversion, if you want to save space you could benefit from 
>>> using a HDT store rather than using TDB.
>>> 
>>> 
>>> 
>>> Sent: Sunday, November 26, 2017 at 5:30 AM
>>> From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>
>>> To: users@jena.apache.org
>>> Subject: Re: Estimating TDB2 size
>>> i have   specific questiosn in relation to what ajs6f said:
>>> 
>>> i have a TDB store with 1/3 triples with very small literals (3-5 char),
>>> where the same sequence is often repeated. would i get smaller store and
>>> better performance if these were URI of the character sequence (stored
>>> once for each repeated case)? any guess how much I could improve?
>>> 
>>> does the size of the URI play a role in the amount of storage used. i
>>> observe that i have for 33 M triples a TDB size (files) of 13 GB, which
>>> means about 300 byte per triple. the literals are all short (very seldom
>>> more than 10 char, mostly 5 - words from english text). is is a named
>>> graph, if this makes a difference.
>>> 
>>> thank you!
>>> 
>>> andrew
>>> 
> 
> -- 
> em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
>                                 +43 1 58801 12710 direct
> Geoinformation, TU Wien          +43 1 58801 12700 office
> Gusshausstr. 27-29               +43 1 55801 12799 fax
> 1040 Wien Austria                +43 676 419 25 72 mobil
>

Re: Estimating TDB2 size

Reply via email to