Re: [TDB2 & Fuseki 4.4.0] Huge tdb2 database disk size when performing incremental SPARQL update to endpoint.

Andy Seaborne Fri, 11 Feb 2022 06:36:39 -0800



On 11/02/2022 11:08, Cédric Viaccoz wrote:

Hi,

Thank you for the answers. The lack of UI option + the fact I could not move 
the databases folder from 4.3.2 to 4.4.0 without a  tdb lock exception being 
risen lead me to believe it was no longer supported. I am very glad to know 
that ain't the case, I am currently synching all of our datasets to TDB1 
configured endpoints on fuseki 4.4, I am no longer blocked.

  However, I would not mind having the option in the future to switch to TDB2 
with our current architecture design, so I'm still very interested in 
understanding what is happening with the disk size inflation if anyone has an 
idea ?


For a usage pattern of many small updates, TDB1 still has a role.

TDB2 transactions are based around copy-on-write in the indexes. The oldusage is not recycles - it may still be in use by transactions thathaven't completed, and also full tracking would be quite expensive.Instead, the space is reclaimed by running the TDB2 compact operatingwhich copies the in-use part of the database. These are the Data-NNNNsubdirectories. only the top most numbered one is in use. The others canbe deleted, archive, compressed, moved eleswhere etc - your choice.

It does man TDB2 grows appreciable then contracts where as TDB1 is basedon writing back to the same database. TDB1 has it's own consequences.Writing back needs the database to a few moments of being quiet else thetransaction log simply keeps growing. TDB1 tranactions are also limitedin size as they are temporarily retained in the java heap as well asbeing in the transaction log. You can run out of heap.


    Andy


Also, is the TDB1 vs TDB2 situation akin to something like Python2 vs Python3 ? 
What I mean by that is that they are both supported for the time being, but it 
is highly encouraged to move to the newer one as the old one might not receive 
support in the near or distant future?

Best regards,
Cédric

-----Original Message-----
From: Andy Seaborne <[email protected]>
Sent: Thursday, February 10, 2022 6:49 PM
To: [email protected]
Subject: Re: [TDB2 & Fuseki 4.4.0] Huge tdb2 database disk size when performing 
incremental SPARQL update to endpoint.



On 10/02/2022 17:18, Dave Reynolds wrote:

While I can't help with the substance of this question ...

  > Since, as far as I know, the latest fuseki (4.4.0) no longer
supports
TDB1

I don't think that's correct. While there are new features of TDB2 in
the new release (the faster loader) I don't believe TDB 1 has been
deprecated let alone dropped.


Yes - the only difference in Fuseki 4.4.0 is that there isn't a UI option.

All existing databases work; or when the server command line is used, or given 
as a configuration file, TDB1 is available.

      Andy


Dave

On 10/02/2022 16:58, Cédric Viaccoz wrote:

Hello everyone,

I deploy a data treatment pipeline at the University of Geneva where
a linked data platform, Fedora Commons Repository
(https://duraspace.org/fedora/ <https://duraspace.org/fedora/>)
database is loaded with researchers’ data, and then its RDF metadata
is synchronized/uploaded to a fuseki triplestore. The synchronization
tool I use is the fcrepo-indexing-triplestore messaging application
from the fcrepo-camel-toolbox
(https://github.com/fcrepo-exts/fcrepo-camel-toolbox
<https://github.com/fcrepo-exts/fcrepo-camel-toolbox>), basically an
Apache Camel application designed to synchronize Fedora with an
external triplestore.

Since, as far as I know, the latest fuseki (4.4.0) no longer supports
TDB1, I opted to migrate all the projects’ data to TDB2, meaning
synchronizing the whole of the data from Fedora to Fuseki, this time
making the camel app pointing to TDB2 based endpoints.


However, I noticed that the data volume as it is stored in fuseki in
the “<FUSEKI_BASE>/databases” folder increased drastically in TDB2
compared to TDB1. For instance, a dataset which used to occupy 74Mb
of data on TDB1 now weighs more than 11Gb! After some investigation I
hypothesized that incremental insertion of triples in TDB2 endpoint
create bigger disk footprint than a single batch load (where as in
TDB1 both loading strategy leads to the same disk footprint).

It is quite tiresome to replicate my precise use case, because it
requires deploying a Fedora repository and a camel application, so
instead I included to this mail a zip containing a small sample of
our data as a turtle file and a python script that “emulates” the
behavior of the data synchronization between fedora and fuseki. If
you create a persistent TDB2 dataset on your local fuseki listening
on localhost port 3030, and name this dataset “gypso”, then running
the Python script “triplestore_incremental_update.py” will, for each
single triple from the “gypso.ttl” file, send an INSERT DATA {}
sparql query to the fuseki gypso/update endpoint. Please note that
the phython script uses the package rdflib, so installing it through
“pip install rdflib” previously might be necessary. On my Debian
server, the resulting size of the database (can be checked  by the
linux command “du -h <FUSEKI_BASE>/databases/gypso/Data-001”) was
50Mb, whereas directly uploading the “gypso.ttl” file to then
endpoint results in a size of only 538Kb even though the data and
query performance is identical after either loading strategy.

I know that as a workaround I could serialize all the data from our
infrastructure into compact turtle files and then directly uploads
them to TDB2 endpoints, but the data on Fedora side gets updated
regularly, so having the camel application taking care of doing
automatic synchronization is necessary, besides this was not an issue
at all on TDB1. Would anyone have an idea what might be the culprit
behind this behavior ?

If you need additional details, by looking at the individual file
size under “Data-001” I noticed that only the following files grow
between the two different loading strategies : “SPO.idn”,
“nodes.idn”, “nodes.dat”, “OSP.dat”, “POS.idn”, “OSP.idn”, “POS.dat” and 
“SPO.dat”.
I also have included to this mail a screenshot displaying a
side-by-side comparison of the size of the databases files between
gypso.ttl loaded incrementally on the left, and as a single file
upload and the right. Hope this can maybe give a more low-level
vision on the issue.

Best regards,

Cédric Viaccoz

*Concepteur-Développeur au sein du domaine fonctionnel “Recherche et
Information Scientifique (RISe)”*

Division du système et des technologies de l'information et de la
communication/ IT Services (DISTIC)

Université de Genève | 24 rue Général-Dufour | Bureau 338

Tél : +41 22 379 71 10

Re: [TDB2 & Fuseki 4.4.0] Huge tdb2 database disk size when performing incremental SPARQL update to endpoint.

Reply via email to