Re: Estimating TDB2 size

ajs6f Sun, 26 Nov 2017 08:59:12 -0800

The point is that HDT is expressly designed to be compact at all costs, and TDB 
is not at all. The indexes are one (important) aspect. From 
http://www.rdfhdt.org/what-is-hdt/:


"The internal compression techniques of HDT allow that most part of the data 
(or even the whole dataset) can be kept in main memory, which is several orders 
of magnitude faster than disks."

"HDT is read-only, so it can dispatch many queries per second using multiple 
threads."

That is a radically different design that Jena TDB, which relies on OS-provided 
file caching and offers transactional updates. Benchmarking is hard at best, 
and comparing software with different priorities and intentions doesn't help.

Otherwise, you could compare HDT with Jena's in-memory datasets, which 
(obviously) do expect that the data is kept in memory.

ajs6f

> On Nov 26, 2017, at 11:44 AM, Laura Morales <laure...@mail.com> wrote:
> 
> HDT does actually create more indices "out of band", that it they create a 
> separate file.hdt.index files. The combined size however is still much 
> smaller than a TDB store of the same file, but I don't know if this is down 
> to TDB simply having more indices.
>  
>  
> 
> Sent: Sunday, November 26, 2017 at 5:20 PM
> From: ajs6f <aj...@apache.org>
> To: users@jena.apache.org
> Subject: Re: Estimating TDB2 size
> You have to start with the understanding that the indexes in a database are 
> not the same thing nor for the same purpose as a simple file of triples or 
> quads. TDB1 and 2 store the same triple several times in different orders 
> (SPO, OPS, etc.) in order to be able to answer arbitrary queries with good 
> performance, which is a common technique.
> 
> There is no reason to expect a database that is capable of answering 
> arbitrary queries with good performance to be as small as a file, which is 
> not.
> 
> ajs6f

Re: Estimating TDB2 size

Reply via email to