Hello Andy,

It’s TDB2. See this graph: https://ibb.co/SPScf9V

At the end of the day it’s up to twice the size if compacting daily.
So as I understand, it’s an implementation detail, right?

Jan

On 2020/04/24 20:27:29, Andy Seaborne <[email protected]> wrote: 
> Hi Jan,
> 
> You don't say whether this is TDB1 and TDB2 - they behave differently, 
> both grow though by varying degrees, TDB1 somewhat less quickly than 
> TDB2.  What's more with small databases like 40M,
> 
> TDB2 has a compaction function which, in effect, does the dump/restore 
> but faster and with no downtime for read operations.  Unfortunately this 
> isn't available in Fuseki so a stop-compact-restart is necessary (it can 
> be done - the code isn't written).
> 
> TBD2 uses MVCC datastructure - when you add data, the new data added to 
> copied blocks in the index. This has the advantage of arbitrary large 
> transactions, include bulk loads, while running. The loads are faster as 
> well because some data is written to disk async by the OS while the 
> update is in-progress.  Outstanding read transactions continue reading 
> the old data.  Indeed, doe TDB2, it would be possible run a query on any 
> previous state of the database - it never forgets until a compaction 
> happens.
> 
> TBD1 grows but more slowly. It does not always reuse index blocks freed 
> up when the B+Tree blocks are split.  TDB1 does not finish and write 
> back transactions until after the transaction has finished which limits 
> the transaction size.
> 
> At the moment, for both cases, offline repacking is necessary.
> 
>      Andy
> 
> On 24/04/2020 11:07, Jan Šmucr wrote:
> > Hello.
> > 
> > I'm building a file processing workflow monitoring system based on Jena 
> > Fuseki. The goal for this system is to be purely additive. Individual 
> > events are a pieces of knowledge about each of the jobs eventually 
> > connected via various identifiers and references. Finally I can search for 
> > an event and with basic knowledge of the scheme I can rebuild a graph 
> > representing the whole processing job, and display it to the customer. It 
> > seems to work and I'm happy about the whole idea.
> > 
> > There's however one thing I'd like to solve and that is the incredible 
> > amount of space the triplestore consumes. Currently there's about 40M 
> > triples (2 months of traffic approximately) and if unmaintained, the amount 
> > of disk space the database consumes is huge. The maintenance process is to 
> > stop Fuseki -> dump database -> backup -> delete the old database -> load 
> > the dump -> start Fuseki. At this point the database is at most 10 % of 
> > what it was before the maintenance. Then it grows back and even more with 
> > new data being added.
> > 
> > Note that approx. half of triples in the inserts might already be in the 
> > database. Example:
> > 
> > ### First event
> > 
> > e:MyReceiveEvent
> >      a e:Event ;
> >      a e:Receive ;
> >      e:subjectMessage e:MyMessage .
> > e:MyMessage
> >      a e:Message ;
> >      a e:guid "MyMessage"^^xsd:string .
> > 
> > ### Second event
> > 
> > e:MySendEvent
> >      a e:Event ;
> >      a e:Send ;
> >      e:subjectMessage e:MyMessage .
> > e:MyMessage
> >      a e:Message ;
> >      a e:guid "MyMessage"^^xsd:string .
> > 
> > This is because I need to use inserts only and I don't know anything about 
> > the triplestore contents at the point I emit the event. My testing however 
> > didn't show any database growth when I did this on purpose.
> > 
> > How to fight this? Is it because of all the inserts and some related 
> > triplestore designs? What steps should I perform to optimize the store to 
> > suit the scenario?
> > 
> > Thank you very much for your responses.
> > 
> > Jan
> > 
> > 
> 

Reply via email to