I wonder... if TDB like HDT uses integers instead of strings, why is there such a difference in the store size? HDT files are so much smaller.
Sent: Sunday, November 26, 2017 at 1:30 PM From: "Andy Seaborne" <a...@apache.org> To: users@jena.apache.org Subject: Re: Estimating TDB2 size Every RDFTerm gets a NodeId in TDB. A triple is 3 NodeIds. There is a big cache, NodeId->RDFTerm. In TDB1 and TDB2, a NodeId is stored as 8 bytes. TDB2 design is an int and long (96 bits) - the current implementation is using 64 bits. It is very common as a design to dictionary (intern) terms because joins can be done by comparing a integers, not testing whether two strings are the same, which is much more expensive. In addition TDBx inlines numbers integers and date/times and some others. https://jena.apache.org/documentation/tdb/architecture.html TDBx could, but doesn't, store compressed data on disk. There are pros and cons of this. Andy