Re: [TDB2 & Fuseki 4.4.0] Huge tdb2 database disk size when performing incremental SPARQL update to endpoint.

Andy Seaborne Thu, 10 Feb 2022 09:49:24 -0800



On 10/02/2022 17:18, Dave Reynolds wrote:

While I can't help with the substance of this question ...
> Since, as far as I know, the latest fuseki (4.4.0) no longer supportsTDB1
I don't think that's correct. While there are new features of TDB2 inthe new release (the faster loader) I don't believe TDB 1 has beendeprecated let alone dropped.


Yes - the only difference in Fuseki 4.4.0 is that there isn't a UI option.

All existing databases work; or when the server command line is used, orgiven as a configuration file, TDB1 is available.


    Andy

Dave

On 10/02/2022 16:58, Cédric Viaccoz wrote:
Hello everyone,
I deploy a data treatment pipeline at the University of Geneva where alinked data platform, Fedora Commons Repository(https://duraspace.org/fedora/ <https://duraspace.org/fedora/>)database is loaded with researchers’ data, and then its RDF metadatais synchronized/uploaded to a fuseki triplestore. The synchronizationtool I use is the fcrepo-indexing-triplestore messaging applicationfrom the fcrepo-camel-toolbox(https://github.com/fcrepo-exts/fcrepo-camel-toolbox<https://github.com/fcrepo-exts/fcrepo-camel-toolbox>), basically anApache Camel application designed to synchronize Fedora with anexternal triplestore.
Since, as far as I know, the latest fuseki (4.4.0) no longer supportsTDB1, I opted to migrate all the projects’ data to TDB2, meaningsynchronizing the whole of the data from Fedora to Fuseki, this timemaking the camel app pointing to TDB2 based endpoints.
However, I noticed that the data volume as it is stored in fuseki inthe “<FUSEKI_BASE>/databases” folder increased drastically in TDB2compared to TDB1. For instance, a dataset which used to occupy 74Mb ofdata on TDB1 now weighs more than 11Gb! After some investigation Ihypothesized that incremental insertion of triples in TDB2 endpointcreate bigger disk footprint than a single batch load (where as inTDB1 both loading strategy leads to the same disk footprint).
It is quite tiresome to replicate my precise use case, because itrequires deploying a Fedora repository and a camel application, soinstead I included to this mail a zip containing a small sample of ourdata as a turtle file and a python script that “emulates” the behaviorof the data synchronization between fedora and fuseki. If you create apersistent TDB2 dataset on your local fuseki listening on localhostport 3030, and name this dataset “gypso”, then running the Pythonscript “triplestore_incremental_update.py” will, for each singletriple from the “gypso.ttl” file, send an INSERT DATA {} sparql queryto the fuseki gypso/update endpoint. Please note that the phythonscript uses the package rdflib, so installing it through “pip installrdflib” previously might be necessary. On my Debian server, theresulting size of the database (can be checked by the linux command“du -h <FUSEKI_BASE>/databases/gypso/Data-001”) was 50Mb, whereasdirectly uploading the “gypso.ttl” file to then endpoint results in asize of only 538Kb even though the data and query performance isidentical after either loading strategy.
I know that as a workaround I could serialize all the data from ourinfrastructure into compact turtle files and then directly uploadsthem to TDB2 endpoints, but the data on Fedora side gets updatedregularly, so having the camel application taking care of doingautomatic synchronization is necessary, besides this was not an issueat all on TDB1. Would anyone have an idea what might be the culpritbehind this behavior ?
If you need additional details, by looking at the individual file sizeunder “Data-001” I noticed that only the following files grow betweenthe two different loading strategies : “SPO.idn”, “nodes.idn”,“nodes.dat”, “OSP.dat”, “POS.idn”, “OSP.idn”, “POS.dat” and “SPO.dat”.I also have included to this mail a screenshot displaying aside-by-side comparison of the size of the databases files betweengypso.ttl loaded incrementally on the left, and as a single fileupload and the right. Hope this can maybe give a more low-level visionon the issue.
Best regards,

Cédric Viaccoz
*Concepteur-Développeur au sein du domaine fonctionnel “Recherche etInformation Scientifique (RISe)”*
Division du système et des technologies de l'information et de lacommunication/ IT Services (DISTIC)
Université de Genève | 24 rue Général-Dufour | Bureau 338

Tél : +41 22 379 71 10

Re: [TDB2 & Fuseki 4.4.0] Huge tdb2 database disk size when performing incremental SPARQL update to endpoint.

Reply via email to