Hello Andy, It’s TDB2. See this graph: https://ibb.co/SPScf9V
At the end of the day it’s up to twice the size if compacting daily. So as I understand, it’s an implementation detail, right? Jan On 2020/04/24 20:27:29, Andy Seaborne <[email protected]> wrote: > Hi Jan, > > You don't say whether this is TDB1 and TDB2 - they behave differently, > both grow though by varying degrees, TDB1 somewhat less quickly than > TDB2. What's more with small databases like 40M, > > TDB2 has a compaction function which, in effect, does the dump/restore > but faster and with no downtime for read operations. Unfortunately this > isn't available in Fuseki so a stop-compact-restart is necessary (it can > be done - the code isn't written). > > TBD2 uses MVCC datastructure - when you add data, the new data added to > copied blocks in the index. This has the advantage of arbitrary large > transactions, include bulk loads, while running. The loads are faster as > well because some data is written to disk async by the OS while the > update is in-progress. Outstanding read transactions continue reading > the old data. Indeed, doe TDB2, it would be possible run a query on any > previous state of the database - it never forgets until a compaction > happens. > > TBD1 grows but more slowly. It does not always reuse index blocks freed > up when the B+Tree blocks are split. TDB1 does not finish and write > back transactions until after the transaction has finished which limits > the transaction size. > > At the moment, for both cases, offline repacking is necessary. > > Andy > > On 24/04/2020 11:07, Jan Šmucr wrote: > > Hello. > > > > I'm building a file processing workflow monitoring system based on Jena > > Fuseki. The goal for this system is to be purely additive. Individual > > events are a pieces of knowledge about each of the jobs eventually > > connected via various identifiers and references. Finally I can search for > > an event and with basic knowledge of the scheme I can rebuild a graph > > representing the whole processing job, and display it to the customer. It > > seems to work and I'm happy about the whole idea. > > > > There's however one thing I'd like to solve and that is the incredible > > amount of space the triplestore consumes. Currently there's about 40M > > triples (2 months of traffic approximately) and if unmaintained, the amount > > of disk space the database consumes is huge. The maintenance process is to > > stop Fuseki -> dump database -> backup -> delete the old database -> load > > the dump -> start Fuseki. At this point the database is at most 10 % of > > what it was before the maintenance. Then it grows back and even more with > > new data being added. > > > > Note that approx. half of triples in the inserts might already be in the > > database. Example: > > > > ### First event > > > > e:MyReceiveEvent > > a e:Event ; > > a e:Receive ; > > e:subjectMessage e:MyMessage . > > e:MyMessage > > a e:Message ; > > a e:guid "MyMessage"^^xsd:string . > > > > ### Second event > > > > e:MySendEvent > > a e:Event ; > > a e:Send ; > > e:subjectMessage e:MyMessage . > > e:MyMessage > > a e:Message ; > > a e:guid "MyMessage"^^xsd:string . > > > > This is because I need to use inserts only and I don't know anything about > > the triplestore contents at the point I emit the event. My testing however > > didn't show any database growth when I did this on purpose. > > > > How to fight this? Is it because of all the inserts and some related > > triplestore designs? What steps should I perform to optimize the store to > > suit the scenario? > > > > Thank you very much for your responses. > > > > Jan > > > > >
