Re: Estimating TDB2 size

Rob Vesse Mon, 27 Nov 2017 02:46:09 -0800

Find out that you specifically said that your data file was compressed with 
GZip. You can often get up to around 32 times compression for NTriples because 
it’s verbosity makes it extremely well-suited to GZip compression. Therefore, 
your uncompressed data is probably more like 6-7 GB so Full coverage indexing 
and dictionary encoding are roughly doubling the actual data size.


Rob

On 26/11/2017, 18:11, "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at> wrote:

    of course, my comparison was naive - and I should have known better, 
    working with databases back in the 1980s. still it surprises when a 200 
    mb files becomes 13 gb - given that diskspace and memory is inexpensive 
    compared to human time waiting for responses, the design choices are 
    amply justified. i will buy more memory (;-)
    
    andrew
    
    
    On 11/26/2017 11:20 AM, ajs6f wrote:
    > You have to start with the understanding that the indexes in a database 
are not the same thing nor for the same purpose as a simple file of triples or 
quads. TDB1 and 2 store the same triple several times in different orders (SPO, 
OPS, etc.) in order to be able to answer arbitrary queries with good 
performance, which is a common technique.
    >
    > There is no reason to expect a database that is capable of answering 
arbitrary queries with good performance to be as small as a file, which is not.
    >
    > ajs6f
    >
    >> On Nov 26, 2017, at 11:17 AM, Andrew U. Frank 
<fr...@geoinfo.tuwien.ac.at> wrote:
    >>
    >> thank you for the explanations:
    >>
    >> to laura: i guess HDT would reduce the size of my files considerably. 
where could i find information how to use fuseki with HDT? i might be worth 
trying and see how response time changes.
    >>
    >> to andy: am i correct to understand that a triple (uri p literal) is 
translated in two triples (uri p uriX) and a second one (uriX s literal) for 
some properties p and s? is there any reuse of existing literals? that would 
give for each literal triple approx. 60 bytes?
    >>
    >> i still do not undestand how a triple needs about 300 bytes of storage? 
(or how an nt.gzip file of 219 M igives a TDB database of 13 GB)
    >>
    >> size of the database is of concern to me and I think it influences 
performance through the use of IO time.
    >>
    >> thank you all very much for the clarifications!
    >>
    >> andrew
    >>
    >>
    >>
    >> On 11/26/2017 07:30 AM, Andy Seaborne wrote:
    >>> Every RDFTerm gets a NodeId in TDB.  A triple is 3 NodeIds.
    >>>
    >>> There is a big cache, NodeId->RDFTerm.
    >>>
    >>> In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int 
and long (96 bits) - the current implementation is using 64 bits.
    >>>
    >>> It is very common as a design to dictionary (intern) terms because 
joins can be done by comparing a integers, not testing whether two strings are 
the same, which is much more expensive.
    >>>
    >>> In addition TDBx inlines numbers integers and date/times and some 
others.
    >>>
    >>> https://jena.apache.org/documentation/tdb/architecture.html
    >>>
    >>> TDBx could, but doesn't, store compressed data on disk. There are pros 
and cons of this.
    >>>
    >>>      Andy
    >>>
    >>> On 26/11/17 08:30, Laura Morales wrote:
    >>>> Perhaps a bit tangential but this is somehow related to how HDT stores 
its data (I've run some tests with Fuseki + HDT store instead of TDB). 
Basically, they assign each subject, predicate, and object an integer value. It 
keeps an index to map integers with the corresponding string (of the original 
value), and then they store every triple using integers instead of strings 
(something like "1 2 9. 8 2 1 ." and so forth. The drawback I think is that 
they have to translate indices/strings back and forth at each query, 
nonetheless the response time is still impressive (milliseconds), and it 
compresses the original file *a lot*. By a lot I mean that for Wikidata (not 
the full file though, but one with about 2.3 billion triples) the HDT is more 
or less 40GB, and gz-compressed about 10GB. The problem is that their rdf2hdt 
tool is so inefficient that it does everything in RAM, so to convert something 
like wikidata you'd need at least a machine with 512GB of ram (or swap if you 
have a fast enough swap :D). Also the tool looks like it can't handle files 
with more than 2^32 triples, although HDT (the format) does handle them. So as 
long as you can handle the conversion, if you want to save space you could 
benefit from using a HDT store rather than using TDB.
    >>>>
    >>>>
    >>>>
    >>>> Sent: Sunday, November 26, 2017 at 5:30 AM
    >>>> From: "Andrew U. Frank" <fr...@geoinfo.tuwien.ac.at>
    >>>> To: users@jena.apache.org
    >>>> Subject: Re: Estimating TDB2 size
    >>>> i have   specific questiosn in relation to what ajs6f said:
    >>>>
    >>>> i have a TDB store with 1/3 triples with very small literals (3-5 
char),
    >>>> where the same sequence is often repeated. would i get smaller store 
and
    >>>> better performance if these were URI of the character sequence (stored
    >>>> once for each repeated case)? any guess how much I could improve?
    >>>>
    >>>> does the size of the URI play a role in the amount of storage used. i
    >>>> observe that i have for 33 M triples a TDB size (files) of 13 GB, which
    >>>> means about 300 byte per triple. the literals are all short (very 
seldom
    >>>> more than 10 char, mostly 5 - words from english text). is is a named
    >>>> graph, if this makes a difference.
    >>>>
    >>>> thank you!
    >>>>
    >>>> andrew
    >>>>
    >> -- 
    >> em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
    >>                                  +43 1 58801 12710 direct
    >> Geoinformation, TU Wien          +43 1 58801 12700 office
    >> Gusshausstr. 27-29               +43 1 55801 12799 fax
    >> 1040 Wien Austria                +43 676 419 25 72 mobil
    >>
    
    -- 
    em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                      +43 1 58801 12710 direct
    Geoinformation, TU Wien          +43 1 58801 12700 office
    Gusshausstr. 27-29               +43 1 55801 12799 fax
    1040 Wien Austria                +43 676 419 25 72 mobil

Re: Estimating TDB2 size

Reply via email to