Hi Jan,
You don't say whether this is TDB1 and TDB2 - they behave differently,
both grow though by varying degrees, TDB1 somewhat less quickly than
TDB2. What's more with small databases like 40M,
TDB2 has a compaction function which, in effect, does the dump/restore
but faster and with no downtime for read operations. Unfortunately this
isn't available in Fuseki so a stop-compact-restart is necessary (it can
be done - the code isn't written).
TBD2 uses MVCC datastructure - when you add data, the new data added to
copied blocks in the index. This has the advantage of arbitrary large
transactions, include bulk loads, while running. The loads are faster as
well because some data is written to disk async by the OS while the
update is in-progress. Outstanding read transactions continue reading
the old data. Indeed, doe TDB2, it would be possible run a query on any
previous state of the database - it never forgets until a compaction
happens.
TBD1 grows but more slowly. It does not always reuse index blocks freed
up when the B+Tree blocks are split. TDB1 does not finish and write
back transactions until after the transaction has finished which limits
the transaction size.
At the moment, for both cases, offline repacking is necessary.
Andy
On 24/04/2020 11:07, Jan Šmucr wrote:
Hello.
I'm building a file processing workflow monitoring system based on Jena Fuseki.
The goal for this system is to be purely additive. Individual events are a
pieces of knowledge about each of the jobs eventually connected via various
identifiers and references. Finally I can search for an event and with basic
knowledge of the scheme I can rebuild a graph representing the whole processing
job, and display it to the customer. It seems to work and I'm happy about the
whole idea.
There's however one thing I'd like to solve and that is the incredible amount of space the
triplestore consumes. Currently there's about 40M triples (2 months of traffic approximately)
and if unmaintained, the amount of disk space the database consumes is huge. The maintenance
process is to stop Fuseki -> dump database -> backup -> delete the old database ->
load the dump -> start Fuseki. At this point the database is at most 10 % of what it was
before the maintenance. Then it grows back and even more with new data being added.
Note that approx. half of triples in the inserts might already be in the
database. Example:
### First event
e:MyReceiveEvent
a e:Event ;
a e:Receive ;
e:subjectMessage e:MyMessage .
e:MyMessage
a e:Message ;
a e:guid "MyMessage"^^xsd:string .
### Second event
e:MySendEvent
a e:Event ;
a e:Send ;
e:subjectMessage e:MyMessage .
e:MyMessage
a e:Message ;
a e:guid "MyMessage"^^xsd:string .
This is because I need to use inserts only and I don't know anything about the
triplestore contents at the point I emit the event. My testing however didn't
show any database growth when I did this on purpose.
How to fight this? Is it because of all the inserts and some related
triplestore designs? What steps should I perform to optimize the store to suit
the scenario?
Thank you very much for your responses.
Jan