On 10/02/2022 17:18, Dave Reynolds wrote:
While I can't help with the substance of this question ...
> Since, as far as I know, the latest fuseki (4.4.0) no longer supports
TDB1
I don't think that's correct. While there are new features of TDB2 in
the new release (the faster loader) I don't believe TDB 1 has been
deprecated let alone dropped.
Yes - the only difference in Fuseki 4.4.0 is that there isn't a UI option.
All existing databases work; or when the server command line is used, or
given as a configuration file, TDB1 is available.
Andy
Dave
On 10/02/2022 16:58, Cédric Viaccoz wrote:
Hello everyone,
I deploy a data treatment pipeline at the University of Geneva where a
linked data platform, Fedora Commons Repository
(https://duraspace.org/fedora/ <https://duraspace.org/fedora/>)
database is loaded with researchers’ data, and then its RDF metadata
is synchronized/uploaded to a fuseki triplestore. The synchronization
tool I use is the fcrepo-indexing-triplestore messaging application
from the fcrepo-camel-toolbox
(https://github.com/fcrepo-exts/fcrepo-camel-toolbox
<https://github.com/fcrepo-exts/fcrepo-camel-toolbox>), basically an
Apache Camel application designed to synchronize Fedora with an
external triplestore.
Since, as far as I know, the latest fuseki (4.4.0) no longer supports
TDB1, I opted to migrate all the projects’ data to TDB2, meaning
synchronizing the whole of the data from Fedora to Fuseki, this time
making the camel app pointing to TDB2 based endpoints.
However, I noticed that the data volume as it is stored in fuseki in
the “<FUSEKI_BASE>/databases” folder increased drastically in TDB2
compared to TDB1. For instance, a dataset which used to occupy 74Mb of
data on TDB1 now weighs more than 11Gb! After some investigation I
hypothesized that incremental insertion of triples in TDB2 endpoint
create bigger disk footprint than a single batch load (where as in
TDB1 both loading strategy leads to the same disk footprint).
It is quite tiresome to replicate my precise use case, because it
requires deploying a Fedora repository and a camel application, so
instead I included to this mail a zip containing a small sample of our
data as a turtle file and a python script that “emulates” the behavior
of the data synchronization between fedora and fuseki. If you create a
persistent TDB2 dataset on your local fuseki listening on localhost
port 3030, and name this dataset “gypso”, then running the Python
script “triplestore_incremental_update.py” will, for each single
triple from the “gypso.ttl” file, send an INSERT DATA {} sparql query
to the fuseki gypso/update endpoint. Please note that the phython
script uses the package rdflib, so installing it through “pip install
rdflib” previously might be necessary. On my Debian server, the
resulting size of the database (can be checked by the linux command
“du -h <FUSEKI_BASE>/databases/gypso/Data-001”) was 50Mb, whereas
directly uploading the “gypso.ttl” file to then endpoint results in a
size of only 538Kb even though the data and query performance is
identical after either loading strategy.
I know that as a workaround I could serialize all the data from our
infrastructure into compact turtle files and then directly uploads
them to TDB2 endpoints, but the data on Fedora side gets updated
regularly, so having the camel application taking care of doing
automatic synchronization is necessary, besides this was not an issue
at all on TDB1. Would anyone have an idea what might be the culprit
behind this behavior ?
If you need additional details, by looking at the individual file size
under “Data-001” I noticed that only the following files grow between
the two different loading strategies : “SPO.idn”, “nodes.idn”,
“nodes.dat”, “OSP.dat”, “POS.idn”, “OSP.idn”, “POS.dat” and “SPO.dat”.
I also have included to this mail a screenshot displaying a
side-by-side comparison of the size of the databases files between
gypso.ttl loaded incrementally on the left, and as a single file
upload and the right. Hope this can maybe give a more low-level vision
on the issue.
Best regards,
Cédric Viaccoz
*Concepteur-Développeur au sein du domaine fonctionnel “Recherche et
Information Scientifique (RISe)”*
Division du système et des technologies de l'information et de la
communication/ IT Services (DISTIC)
Université de Genève | 24 rue Général-Dufour | Bureau 338
Tél : +41 22 379 71 10