Hi Jan,

You don't say whether this is TDB1 and TDB2 - they behave differently, both grow though by varying degrees, TDB1 somewhat less quickly than TDB2. What's more with small databases like 40M,

TDB2 has a compaction function which, in effect, does the dump/restore but faster and with no downtime for read operations. Unfortunately this isn't available in Fuseki so a stop-compact-restart is necessary (it can be done - the code isn't written).

TBD2 uses MVCC datastructure - when you add data, the new data added to copied blocks in the index. This has the advantage of arbitrary large transactions, include bulk loads, while running. The loads are faster as well because some data is written to disk async by the OS while the update is in-progress. Outstanding read transactions continue reading the old data. Indeed, doe TDB2, it would be possible run a query on any previous state of the database - it never forgets until a compaction happens.

TBD1 grows but more slowly. It does not always reuse index blocks freed up when the B+Tree blocks are split. TDB1 does not finish and write back transactions until after the transaction has finished which limits the transaction size.

At the moment, for both cases, offline repacking is necessary.

    Andy

On 24/04/2020 11:07, Jan Šmucr wrote:
Hello.

I'm building a file processing workflow monitoring system based on Jena Fuseki. 
The goal for this system is to be purely additive. Individual events are a 
pieces of knowledge about each of the jobs eventually connected via various 
identifiers and references. Finally I can search for an event and with basic 
knowledge of the scheme I can rebuild a graph representing the whole processing 
job, and display it to the customer. It seems to work and I'm happy about the 
whole idea.

There's however one thing I'd like to solve and that is the incredible amount of space the 
triplestore consumes. Currently there's about 40M triples (2 months of traffic approximately) 
and if unmaintained, the amount of disk space the database consumes is huge. The maintenance 
process is to stop Fuseki -> dump database -> backup -> delete the old database -> 
load the dump -> start Fuseki. At this point the database is at most 10 % of what it was 
before the maintenance. Then it grows back and even more with new data being added.

Note that approx. half of triples in the inserts might already be in the 
database. Example:

### First event

e:MyReceiveEvent
     a e:Event ;
     a e:Receive ;
     e:subjectMessage e:MyMessage .
e:MyMessage
     a e:Message ;
     a e:guid "MyMessage"^^xsd:string .

### Second event

e:MySendEvent
     a e:Event ;
     a e:Send ;
     e:subjectMessage e:MyMessage .
e:MyMessage
     a e:Message ;
     a e:guid "MyMessage"^^xsd:string .

This is because I need to use inserts only and I don't know anything about the 
triplestore contents at the point I emit the event. My testing however didn't 
show any database growth when I did this on purpose.

How to fight this? Is it because of all the inserts and some related 
triplestore designs? What steps should I perform to optimize the store to suit 
the scenario?

Thank you very much for your responses.

Jan


Reply via email to