RE: [TDB2 & Fuseki 4.4.0] Huge tdb2 database disk size when performing incremental SPARQL update to endpoint.

Cédric Viaccoz Fri, 11 Feb 2022 03:09:26 -0800

Hi,

Thank you for the answers. The lack of UI option + the fact I could not move 
the databases folder from 4.3.2 to 4.4.0 without a  tdb lock exception being 
risen lead me to believe it was no longer supported. I am very glad to know 
that ain't the case, I am currently synching all of our datasets to TDB1 
configured endpoints on fuseki 4.4, I am no longer blocked.


 However, I would not mind having the option in the future to switch to TDB2 
with our current architecture design, so I'm still very interested in 
understanding what is happening with the disk size inflation if anyone has an 
idea ?

Also, is the TDB1 vs TDB2 situation akin to something like Python2 vs Python3 ? 
What I mean by that is that they are both supported for the time being, but it 
is highly encouraged to move to the newer one as the old one might not receive 
support in the near or distant future?

Best regards,
Cédric

-----Original Message-----
From: Andy Seaborne <[email protected]>
Sent: Thursday, February 10, 2022 6:49 PM
To: [email protected]
Subject: Re: [TDB2 & Fuseki 4.4.0] Huge tdb2 database disk size when performing 
incremental SPARQL update to endpoint.



On 10/02/2022 17:18, Dave Reynolds wrote:
> While I can't help with the substance of this question ...
>
>  > Since, as far as I know, the latest fuseki (4.4.0) no longer
> supports
> TDB1
>
> I don't think that's correct. While there are new features of TDB2 in
> the new release (the faster loader) I don't believe TDB 1 has been
> deprecated let alone dropped.

Yes - the only difference in Fuseki 4.4.0 is that there isn't a UI option.

All existing databases work; or when the server command line is used, or given 
as a configuration file, TDB1 is available.

     Andy

>
> Dave
>
> On 10/02/2022 16:58, Cédric Viaccoz wrote:
>> Hello everyone,
>>
>> I deploy a data treatment pipeline at the University of Geneva where
>> a linked data platform, Fedora Commons Repository
>> (https://duraspace.org/fedora/ <https://duraspace.org/fedora/>)
>> database is loaded with researchers’ data, and then its RDF metadata
>> is synchronized/uploaded to a fuseki triplestore. The synchronization
>> tool I use is the fcrepo-indexing-triplestore messaging application
>> from the fcrepo-camel-toolbox
>> (https://github.com/fcrepo-exts/fcrepo-camel-toolbox
>> <https://github.com/fcrepo-exts/fcrepo-camel-toolbox>), basically an
>> Apache Camel application designed to synchronize Fedora with an
>> external triplestore.
>>
>> Since, as far as I know, the latest fuseki (4.4.0) no longer supports
>> TDB1, I opted to migrate all the projects’ data to TDB2, meaning
>> synchronizing the whole of the data from Fedora to Fuseki, this time
>> making the camel app pointing to TDB2 based endpoints.
>>
>>
>> However, I noticed that the data volume as it is stored in fuseki in
>> the “<FUSEKI_BASE>/databases” folder increased drastically in TDB2
>> compared to TDB1. For instance, a dataset which used to occupy 74Mb
>> of data on TDB1 now weighs more than 11Gb! After some investigation I
>> hypothesized that incremental insertion of triples in TDB2 endpoint
>> create bigger disk footprint than a single batch load (where as in
>> TDB1 both loading strategy leads to the same disk footprint).
>>
>> It is quite tiresome to replicate my precise use case, because it
>> requires deploying a Fedora repository and a camel application, so
>> instead I included to this mail a zip containing a small sample of
>> our data as a turtle file and a python script that “emulates” the
>> behavior of the data synchronization between fedora and fuseki. If
>> you create a persistent TDB2 dataset on your local fuseki listening
>> on localhost port 3030, and name this dataset “gypso”, then running
>> the Python script “triplestore_incremental_update.py” will, for each
>> single triple from the “gypso.ttl” file, send an INSERT DATA {}
>> sparql query to the fuseki gypso/update endpoint. Please note that
>> the phython script uses the package rdflib, so installing it through
>> “pip install rdflib” previously might be necessary. On my Debian
>> server, the resulting size of the database (can be checked  by the
>> linux command “du -h <FUSEKI_BASE>/databases/gypso/Data-001”) was
>> 50Mb, whereas directly uploading the “gypso.ttl” file to then
>> endpoint results in a size of only 538Kb even though the data and
>> query performance is identical after either loading strategy.
>>
>> I know that as a workaround I could serialize all the data from our
>> infrastructure into compact turtle files and then directly uploads
>> them to TDB2 endpoints, but the data on Fedora side gets updated
>> regularly, so having the camel application taking care of doing
>> automatic synchronization is necessary, besides this was not an issue
>> at all on TDB1. Would anyone have an idea what might be the culprit
>> behind this behavior ?
>>
>> If you need additional details, by looking at the individual file
>> size under “Data-001” I noticed that only the following files grow
>> between the two different loading strategies : “SPO.idn”,
>> “nodes.idn”, “nodes.dat”, “OSP.dat”, “POS.idn”, “OSP.idn”, “POS.dat” and 
>> “SPO.dat”.
>> I also have included to this mail a screenshot displaying a
>> side-by-side comparison of the size of the databases files between
>> gypso.ttl loaded incrementally on the left, and as a single file
>> upload and the right. Hope this can maybe give a more low-level
>> vision on the issue.
>>
>> Best regards,
>>
>> Cédric Viaccoz
>>
>> *Concepteur-Développeur au sein du domaine fonctionnel “Recherche et
>> Information Scientifique (RISe)”*
>>
>> Division du système et des technologies de l'information et de la
>> communication/ IT Services (DISTIC)
>>
>> Université de Genève | 24 rue Général-Dufour | Bureau 338
>>
>> Tél : +41 22 379 71 10
>>

RE: [TDB2 & Fuseki 4.4.0] Huge tdb2 database disk size when performing incremental SPARQL update to endpoint.

Reply via email to