Hi Vinay,

On 27/01/2022 06:14, Vinay Mahamuni wrote:
Hello,

I am using Apache Jena v4.3.2 + FusekiĀ + TDB2 persistent disk storage. I am using jena RDFConnection to connect to the Fuseki server. I am sending 50k triples in one update. This is mostly new data(only a few triples will match with existing data). These data are instances based on an ontology. Please have a look at the attached file containing how much disk memory increases with each update. For 1.5million triples, it took around 1.2GB. We want to store around a few billions of triples. Thus the bytes/triple ratio won't be good for our use case.

When I used the tdb2.tdbcompact tool, the data volume shrinked to 400MB. But this extra step needs to be performed manually to optimise the storage.

It can be triggered by an admin process with e.g. "cron".

It doesn't have to be done very often unless your volume of 50k triple transactions is very high - in which case I suggest batching them into larger units.


My questions are as follows:

 1. Why 30 update queries each of 50k triples take 3 times more memory
    than a single update query of 1500k triples? Data getting stored is
    the same but memory consumed is more in the first case.

TDB2 uses an MVCC/copy-on-write scheme for transaction isolation. It gives a very high isolation guarantee (serialized),

That means there is a per transaction overhead here which is recovered by compact. In fact, it can't recover at the time because the old data may be in use in read-transactions seeing the pre-write state.

Compact is similar (not identical) to like PostgreSQL VACUUM.

Note that all additional space is recovered by "compact". The active directory is the highest number "Data-NNNN". You can delete the earlier ones once the "compact" has finished as logged in the server log. Or zip them and keep them as backups - Fuseki has released them and does not touch them. Caution: on MS Windows, due to a long standing (10+year) Java JDK issue, the server has to be stopped and restarted to properly release old files.

It doesn't matter whether it was one large write-transaction or 100 write transactions, the compacted database will be the same size. It will have become bigger for 100 writes than 1, but more space is recovered and the new data storage is the same size if you delete the now unused storage areas.

 2. Is there any other way to solve this memory problem?

Schedule "compact", delete the old data storage.

If the update are a stream of updates without reading the database, write a big file (N-triples, Turtle: just write all concatenated to a single file).

You can also consider instead of loading in to Fuseki, to use the bulk loader tbd2.tdbloader to build the database offline, then put in place, then start Fuseki. The bulk loader is significantly faster when sizes get into the 100's millions of triples.

 3. What are the existing strategies that can be used to optimise the
    storage memory while writing data?
 4. Is there any new development going on to use less memory for the
    write/update query?

Just plans that need resources!

It would be nice to have serve-side transactions over several updates (which is beyond what the SPARQL protocol can do).

--

I've tried TDB with other storage systems (e.g. RocksDB) but the ability to directly write the on-disk format is useful - it makes the bulk loader work.

--

There are other issues as well in your use case.

It also depends on the data. If many triples have unique literals/ URIs, the node table is proportionately large

    Andy



Thanks,
Vinay Mahamuni

Reply via email to