Hello everyone,

I deploy a data treatment pipeline at the University of Geneva where a linked 
data platform, Fedora Commons Repository (https://duraspace.org/fedora/) 
database is loaded with researchers' data, and then its RDF metadata is 
synchronized/uploaded to a fuseki triplestore. The synchronization tool I use 
is the fcrepo-indexing-triplestore messaging application from the 
fcrepo-camel-toolbox (https://github.com/fcrepo-exts/fcrepo-camel-toolbox), 
basically an Apache Camel application designed to synchronize Fedora with an 
external triplestore.
Since, as far as I know, the latest fuseki (4.4.0) no longer supports TDB1, I 
opted to migrate all the projects' data to TDB2, meaning synchronizing the 
whole of the data from Fedora to Fuseki, this time making the camel app 
pointing to TDB2 based endpoints.

However, I noticed that the data volume as it is stored in fuseki in the 
"<FUSEKI_BASE>/databases" folder increased drastically in TDB2 compared to 
TDB1. For instance, a dataset which used to occupy 74Mb of data on TDB1 now 
weighs more than 11Gb! After some investigation I hypothesized that incremental 
insertion of triples in TDB2 endpoint create bigger disk footprint than a 
single batch load (where as in TDB1 both loading strategy leads to the same 
disk footprint).

It is quite tiresome to replicate my precise use case, because it requires 
deploying a Fedora repository and a camel application, so instead I included to 
this mail a zip containing a small sample of our data as a turtle file and a 
python script that "emulates" the behavior of the data synchronization between 
fedora and fuseki. If you create a persistent TDB2 dataset on your local fuseki 
listening on localhost port 3030, and name this dataset "gypso", then running 
the Python script "triplestore_incremental_update.py" will, for each single 
triple from the "gypso.ttl" file, send an INSERT DATA {} sparql query to the 
fuseki gypso/update endpoint. Please note that the phython script uses the 
package rdflib, so installing it through "pip install rdflib" previously might 
be necessary. On my Debian server, the resulting size of the database (can be 
checked  by the linux command "du -h <FUSEKI_BASE>/databases/gypso/Data-001") 
was 50Mb, whereas directly uploading the "gypso.ttl" file to then endpoint 
results in a size of only 538Kb even though the data and query performance is 
identical after either loading strategy.

I know that as a workaround I could serialize all the data from our 
infrastructure into compact turtle files and then directly uploads them to TDB2 
endpoints, but the data on Fedora side gets updated regularly, so having the 
camel application taking care of doing automatic synchronization is necessary, 
besides this was not an issue at all on TDB1. Would anyone have an idea what 
might be the culprit behind this behavior ?

If you need additional details, by looking at the individual file size under 
"Data-001" I noticed that only the following files grow between the two 
different loading strategies : "SPO.idn", "nodes.idn", "nodes.dat", "OSP.dat", 
"POS.idn", "OSP.idn", "POS.dat" and "SPO.dat". I also have included to this 
mail a screenshot displaying a side-by-side comparison of the size of the 
databases files between gypso.ttl loaded incrementally on the left, and as a 
single file upload and the right. Hope this can maybe give a more low-level 
vision on the issue.

Best regards,
Cédric Viaccoz
Concepteur-Développeur au sein du domaine fonctionnel "Recherche et Information 
Scientifique (RISe)"
Division du système et des technologies de l'information et de la 
communication/ IT Services (DISTIC)
Université de Genève | 24 rue Général-Dufour | Bureau 338
Tél : +41 22 379 71 10

<<attachment: triplestore_incremental_updates.zip>>

Reply via email to