Re: Fuseki TDB database size growth

Andy Seaborne Wed, 23 Aug 2017 11:30:44 -0700


On 22/08/17 19:50, Chris Tomlinson wrote:

Hi Andy,

In our present production environment we perform daily full backups with 
multiple incremental's during the day and would expect do similar with a Jena 
based system.

We are accustomed to running the primary db without restarts or space 
consumption except for adding of new content for many months at a time.

The backups are compressed master files of each resource which are replicated 
to various sites for archiving.

We have steady low levels of create activity and somewhat less update activity. 
Loading our test platforms takes on the order of a couple of hours from scratch 
which is similar to what we see with the XML db so that is a concern only if we 
are having to do such reloads owing to space loss as a consequence of “normal” 
usage.

"create" activity, presumably adding triples, consumes space as you'dexpect. If the updates are somewhat less, only if an update is muchbigger in terms of work than the "create" is it going to be a problem.Otherwise "create" growth will dominate and that's a necessary thing tohappen. It is all in the details and ultimately it needs an experiment;the initial growth due to undoing to the bulkloader tight packing is atmost 2x the index size and the node table is often as big (and biggerfor you if you store those pages).

My questions are trying to get a sense of how we should expect to use 
Jena/TDB/Fuseki. I was thinking to replace the current native XML db with Jena 
and we have explored some aspects but not nearly enough to understand the best 
practices with Jena.

After reading the comment from Rob regarding the no GC I had thought of a 
compaction tool and was going to inquire about such before I saw your reply. 
Now I want to ask about the status of TDB2. I see that it is at 0.3.0-SNAPSHOT 
aligned with Jena 3.4.0 and am wanting to know about its status as far as 
possible inclusion into Jena.

The project needs to have a discussion - any open source that isn'tdormant has the problem that |wants| > |resources|, sometime very >>.

Taking on a new subsystem is a not insignificant step in terms ofcommitment to the long term, answer questions etc etc.

That's where the user community can help with testing and contributions,as well as all the participation on users@.

And contributions. Thank you for your contributions. Contribution frompeople outside the PMC is great to have.

I can't commit to a timescale realistically except to day it'sprogressing. Not my $job at the moment.

I was also not clear on the answer to my question regarding whether deleting a named graph reclaims any space in the TDB1 node table - I think you’re saying it does not.


correct.

If so that seems to say that with TDB1 the best practice is to view Jean/TDB as 
a create and read system. With TDB2, online compaction permits CRUD operation 
so long as the rate of UD is not too high.

Are reads locked out during online compaction in TDB2?


No - reads continue on the latest current version. Writing is blocked.

In the future, even stopping writers can be relaxed be capturing achange and the playing it onto the compacted database. So writers areheld up just as long as replay takes. Not in the first version though.

That design relies on the changes being logged in rdf-delta, a separatepiece of work though one that is part of my $job where we keep multiplecopies in near-realtime consistency. HA copies of TDB.


    Andy


Regards,
Chris

On Aug 22, 2017, at 7:44 AM, Andy Seaborne <[email protected]> wrote:

There are several different things going on causing the DB to grow: Rob has 
mentioned all of them:

1/ No GC of the node table.
2/ Partial reuse of space in indexes [*].
3/ Bulk loaded database are tight-packed and update fragment after that when 
updated.

[*] Free'd block in index are reused with transactions only.  One HTTP request 
is one transaction so PUT will reuse the space, delete then add will not.

Blank nodes, or any other kind of RDF term, in the node table are not garbage 
collected away.

In TDB2 there is support for live compaction of a database.  (I got the 
machinery working last weekend :-)  c.f. VACUUM in PostgreSQL or OPTIMIZE TABLE 
in MySQL - both reclaim space.  TDB2 is more like a live copy of the current 
state, not an in place chnage at the moment. It is more import to compact in 
TDB2 than TDB1 because, for robustness and performance reasons, the index are 
copy-on-first-write in a transaction.  [Odd side effect - the state of the 
database at any point in time is still there in the files, until you compact 
it.]

TDB1 (the version in Jena) equivalent is backup-restore.

But everyone backups anyway don't they? :-)

For any database, triplestore or SQL or anything, do not put the primary copy 
of your data in the database unless you have an active support contract, and 
then backup anyway (and test the backup).

On 22/08/17 03:22, Chris Tomlinson wrote:

Hi,
This is interesting to know about blank nodes and reference counting. Does the 
comment regarding deleting triples not recovering blank nodes apply if an 
entire named graph which includes some blank nodes is deleted?
If so it seems that in production Jena/TDB is expected to be periodically 
reloaded from scratch or to not use blank nodes very much.


Not delete them in bulk.

In this case is Jena/TDB more aimed at use cases where it perhaps functions 
like an index cache rather than a primary database. Is this accurate? If so 
what sort of primary database systems are typically found coupled with Jena/TDB?


It is not aimed at OLTP-style applications where change is as common as update.

    Andy

Regards,
Chris

On Aug 21, 2017, at 05:28, Rob Vesse <[email protected]> wrote:

All the data structures used in TDB are broadly speaking append only. This 
means that the database Will tend to grow in size overtime.

Certain ways of using the database can exacerbate this. In your example I would 
guess that you have a lot of blank nodes present in the data?

Each unique blank node generates a unique identifier inside the system and will 
continually expand the node table. TDB does not implement reference counting so 
even if you delete every triple that references a given RDF node it will never 
be removed from the node table.

Similarly as the indexes are updated they do not reclaim space so the B+Tree’s 
will continue to grow over time.

Reloading from scratch creates a smaller database because it is able to 
maximally pack the data into the Data structures on disk and you do not have 
any unused identifiers allocated.

Rob

On 21/08/2017 11:20, "Lorenzo Manzoni" <[email protected]> wrote:

    Hi,

        I'm writing you because we have a behavior of fuseki TDB  we can not
    understand:

    */the fuseki database filesystem size continues to grow even if the
    number of triples does not increase substantially./*

    We are using the latest version of fuseki (3.4.0) as triple store of a
    semantic media wiki (mw 1.24, smw 2.1.1) and all the night we have a
    scheduled job that updates the wiki pages and executes maintenance
    scripts(e.g.
    
https://www.semantic-mediawiki.org/wiki/Help:Maintenance_script_%22rebuildData.php%22)
    . These scripts update the semantic data on the wiki and the triples on
    fuseki. Basically every triple are rewritten.

    We have observed that the fuseki database filesystem size grew over time
    to 20Gb but when we recreate it from scratch the database size is only
    500 Mb.

    After that every day  fuseki database grows about 200Mb and the number
    of triples does not change substantially

    I originally assumed that the rebuild data script was the problem but
    when I executed it alone the fuseki database space did not increase.

    We are running fueski on a 64 bit redhat machine.

    Someone can  help us?

    Thanks in advance,

    Lorenzo

Re: Fuseki TDB database size growth

Reply via email to