Re: How to optimize TDB disk storage?

Andy Seaborne Thu, 27 Jan 2022 13:59:01 -0800

Hi Vinay,


On 27/01/2022 06:14, Vinay Mahamuni wrote:

Hello,
I am using Apache Jena v4.3.2 + Fuseki + TDB2 persistent disk storage. Iam using jena RDFConnection to connect to the Fuseki server. I amsending 50k triples in one update. This is mostly new data(only a fewtriples will match with existing data). These data are instances basedon an ontology. Please have a look at the attached file containing howmuch disk memory increases with each update. For 1.5million triples, ittook around 1.2GB. We want to store around a few billions of triples.Thus the bytes/triple ratio won't be good for our use case.
When I used the tdb2.tdbcompact tool, the data volume shrinked to 400MB.But this extra step needs to be performed manually to optimise the storage.


It can be triggered by an admin process with e.g. "cron".

It doesn't have to be done very often unless your volume of 50k tripletransactions is very high - in which case I suggest batching them intolarger units.


My questions are as follows:

 1. Why 30 update queries each of 50k triples take 3 times more memory
    than a single update query of 1500k triples? Data getting stored is
    the same but memory consumed is more in the first case.

TDB2 uses an MVCC/copy-on-write scheme for transaction isolation. Itgives a very high isolation guarantee (serialized),

That means there is a per transaction overhead here which is recoveredby compact. In fact, it can't recover at the time because the old datamay be in use in read-transactions seeing the pre-write state.


Compact is similar (not identical) to like PostgreSQL VACUUM.

Note that all additional space is recovered by "compact". The activedirectory is the highest number "Data-NNNN". You can delete the earlierones once the "compact" has finished as logged in the server log. Or zipthem and keep them as backups - Fuseki has released them and does nottouch them. Caution: on MS Windows, due to a long standing (10+year)Java JDK issue, the server has to be stopped and restarted to properlyrelease old files.

It doesn't matter whether it was one large write-transaction or 100write transactions, the compacted database will be the same size. Itwill have become bigger for 100 writes than 1, but more space isrecovered and the new data storage is the same size if you delete thenow unused storage areas.

 2. Is there any other way to solve this memory problem?


Schedule "compact", delete the old data storage.

If the update are a stream of updates without reading the database,write a big file (N-triples, Turtle: just write all concatenated to asingle file).

You can also consider instead of loading in to Fuseki, to use the bulkloader tbd2.tdbloader to build the database offline, then put in place,then start Fuseki. The bulk loader is significantly faster when sizesget into the 100's millions of triples.

 3. What are the existing strategies that can be used to optimise the
    storage memory while writing data?
 4. Is there any new development going on to use less memory for the
    write/update query?


Just plans that need resources!

It would be nice to have serve-side transactions over several updates(which is beyond what the SPARQL protocol can do).

--

I've tried TDB with other storage systems (e.g. RocksDB) but the abilityto directly write the on-disk format is useful - it makes the bulkloader work.


--

There are other issues as well in your use case.

It also depends on the data. If many triples have unique literals/ URIs,the node table is proportionately large


    Andy



Thanks,
Vinay Mahamuni

Re: How to optimize TDB disk storage?

Reply via email to