Re: TDB2 store grows unexpectedly

Andy Seaborne Fri, 24 Apr 2020 13:28:41 -0700

Hi Jan,

You don't say whether this is TDB1 and TDB2 - they behave differently,both grow though by varying degrees, TDB1 somewhat less quickly thanTDB2. What's more with small databases like 40M,

TDB2 has a compaction function which, in effect, does the dump/restorebut faster and with no downtime for read operations. Unfortunately thisisn't available in Fuseki so a stop-compact-restart is necessary (it canbe done - the code isn't written).

TBD2 uses MVCC datastructure - when you add data, the new data added tocopied blocks in the index. This has the advantage of arbitrary largetransactions, include bulk loads, while running. The loads are faster aswell because some data is written to disk async by the OS while theupdate is in-progress. Outstanding read transactions continue readingthe old data. Indeed, doe TDB2, it would be possible run a query on anyprevious state of the database - it never forgets until a compactionhappens.

TBD1 grows but more slowly. It does not always reuse index blocks freedup when the B+Tree blocks are split. TDB1 does not finish and writeback transactions until after the transaction has finished which limitsthe transaction size.


At the moment, for both cases, offline repacking is necessary.

    Andy

On 24/04/2020 11:07, Jan Šmucr wrote:

Hello.

I'm building a file processing workflow monitoring system based on Jena Fuseki.
The goal for this system is to be purely additive. Individual events are a
pieces of knowledge about each of the jobs eventually connected via various
identifiers and references. Finally I can search for an event and with basic
knowledge of the scheme I can rebuild a graph representing the whole processing
job, and display it to the customer. It seems to work and I'm happy about the
whole idea.

There's however one thing I'd like to solve and that is the incredible amount of space the
triplestore consumes. Currently there's about 40M triples (2 months of traffic approximately)
and if unmaintained, the amount of disk space the database consumes is huge. The maintenance
process is to stop Fuseki -> dump database -> backup -> delete the old database ->
load the dump -> start Fuseki. At this point the database is at most 10 % of what it was
before the maintenance. Then it grows back and even more with new data being added.

Note that approx. half of triples in the inserts might already be in the
database. Example:

### First event

e:MyReceiveEvent
a e:Event ;
a e:Receive ;
e:subjectMessage e:MyMessage .
e:MyMessage
a e:Message ;
a e:guid "MyMessage"^^xsd:string .

### Second event

e:MySendEvent
a e:Event ;
a e:Send ;
e:subjectMessage e:MyMessage .
e:MyMessage
a e:Message ;
a e:guid "MyMessage"^^xsd:string .

This is because I need to use inserts only and I don't know anything about the
triplestore contents at the point I emit the event. My testing however didn't
show any database growth when I did this on purpose.

How to fight this? Is it because of all the inserts and some related
triplestore designs? What steps should I perform to optimize the store to suit
the scenario?

Thank you very much for your responses.

Jan

Re: TDB2 store grows unexpectedly

Reply via email to