You have to start with the understanding that the indexes in a database are not the same thing nor for the same purpose as a simple file of triples or quads. TDB1 and 2 store the same triple several times in different orders (SPO, OPS, etc.) in order to be able to answer arbitrary queries with good performance, which is a common technique.
There is no reason to expect a database that is capable of answering arbitrary queries with good performance to be as small as a file, which is not. ajs6f > On Nov 26, 2017, at 11:17 AM, Andrew U. Frank <fr...@geoinfo.tuwien.ac.at> > wrote: > > thank you for the explanations: > > to laura: i guess HDT would reduce the size of my files considerably. where > could i find information how to use fuseki with HDT? i might be worth trying > and see how response time changes. > > to andy: am i correct to understand that a triple (uri p literal) is > translated in two triples (uri p uriX) and a second one (uriX s literal) for > some properties p and s? is there any reuse of existing literals? that would > give for each literal triple approx. 60 bytes? > > i still do not undestand how a triple needs about 300 bytes of storage? (or > how an nt.gzip file of 219 M igives a TDB database of 13 GB) > > size of the database is of concern to me and I think it influences > performance through the use of IO time. > > thank you all very much for the clarifications! > > andrew > > > > On 11/26/2017 07:30 AM, Andy Seaborne wrote: >> Every RDFTerm gets a NodeId in TDB. A triple is 3 NodeIds. >> >> There is a big cache, NodeId->RDFTerm. >> >> In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int and >> long (96 bits) - the current implementation is using 64 bits. >> >> It is very common as a design to dictionary (intern) terms because joins can >> be done by comparing a integers, not testing whether two strings are the >> same, which is much more expensive. >> >> In addition TDBx inlines numbers integers and date/times and some others. >> >> https://jena.apache.org/documentation/tdb/architecture.html >> >> TDBx could, but doesn't, store compressed data on disk. There are pros and >> cons of this. >> >> Andy >> >> On 26/11/17 08:30, Laura Morales wrote: >>> Perhaps a bit tangential but this is somehow related to how HDT stores its >>> data (I've run some tests with Fuseki + HDT store instead of TDB). >>> Basically, they assign each subject, predicate, and object an integer >>> value. It keeps an index to map integers with the corresponding string (of >>> the original value), and then they store every triple using integers >>> instead of strings (something like "1 2 9. 8 2 1 ." and so forth. The >>> drawback I think is that they have to translate indices/strings back and >>> forth at each query, nonetheless the response time is still impressive >>> (milliseconds), and it compresses the original file *a lot*. By a lot I >>> mean that for Wikidata (not the full file though, but one with about 2.3 >>> billion triples) the HDT is more or less 40GB, and gz-compressed about >>> 10GB. The problem is that their rdf2hdt tool is so inefficient that it does >>> everything in RAM, so to convert something like wikidata you'd need at >>> least a machine with 512GB of ram (or swap if you have a fast enough swap >>> :D). Also the tool looks like it can't handle files with more than 2^32 >>> triples, although HDT (the format) does handle them. So as long as you can >>> handle the conversion, if you want to save space you could benefit from >>> using a HDT store rather than using TDB. >>> >>> >>> >>> Sent: Sunday, November 26, 2017 at 5:30 AM >>> From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at> >>> To: users@jena.apache.org >>> Subject: Re: Estimating TDB2 size >>> i have specific questiosn in relation to what ajs6f said: >>> >>> i have a TDB store with 1/3 triples with very small literals (3-5 char), >>> where the same sequence is often repeated. would i get smaller store and >>> better performance if these were URI of the character sequence (stored >>> once for each repeated case)? any guess how much I could improve? >>> >>> does the size of the URI play a role in the amount of storage used. i >>> observe that i have for 33 M triples a TDB size (files) of 13 GB, which >>> means about 300 byte per triple. the literals are all short (very seldom >>> more than 10 char, mostly 5 - words from english text). is is a named >>> graph, if this makes a difference. >>> >>> thank you! >>> >>> andrew >>> > > -- > em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank > +43 1 58801 12710 direct > Geoinformation, TU Wien +43 1 58801 12700 office > Gusshausstr. 27-29 +43 1 55801 12799 fax > 1040 Wien Austria +43 676 419 25 72 mobil >