There are several different things going on causing the DB to grow: Rob
has mentioned all of them:
1/ No GC of the node table.
2/ Partial reuse of space in indexes [*].
3/ Bulk loaded database are tight-packed and update fragment after that
when updated.
[*] Free'd block in index are reused with transactions only. One HTTP
request is one transaction so PUT will reuse the space, delete then add
will not.
Blank nodes, or any other kind of RDF term, in the node table are not
garbage collected away.
In TDB2 there is support for live compaction of a database. (I got the
machinery working last weekend :-) c.f. VACUUM in PostgreSQL or
OPTIMIZE TABLE in MySQL - both reclaim space. TDB2 is more like a live
copy of the current state, not an in place chnage at the moment. It is
more import to compact in TDB2 than TDB1 because, for robustness and
performance reasons, the index are copy-on-first-write in a transaction.
[Odd side effect - the state of the database at any point in time is
still there in the files, until you compact it.]
TDB1 (the version in Jena) equivalent is backup-restore.
But everyone backups anyway don't they? :-)
For any database, triplestore or SQL or anything, do not put the primary
copy of your data in the database unless you have an active support
contract, and then backup anyway (and test the backup).
On 22/08/17 03:22, Chris Tomlinson wrote:
Hi,
This is interesting to know about blank nodes and reference counting. Does the
comment regarding deleting triples not recovering blank nodes apply if an
entire named graph which includes some blank nodes is deleted?
If so it seems that in production Jena/TDB is expected to be periodically
reloaded from scratch or to not use blank nodes very much.
Not delete them in bulk.
In this case is Jena/TDB more aimed at use cases where it perhaps functions
like an index cache rather than a primary database. Is this accurate? If so
what sort of primary database systems are typically found coupled with Jena/TDB?
It is not aimed at OLTP-style applications where change is as common as
update.
Andy
Regards,
Chris
On Aug 21, 2017, at 05:28, Rob Vesse <[email protected]> wrote:
All the data structures used in TDB are broadly speaking append only. This
means that the database Will tend to grow in size overtime.
Certain ways of using the database can exacerbate this. In your example I would
guess that you have a lot of blank nodes present in the data?
Each unique blank node generates a unique identifier inside the system and will
continually expand the node table. TDB does not implement reference counting so
even if you delete every triple that references a given RDF node it will never
be removed from the node table.
Similarly as the indexes are updated they do not reclaim space so the B+Tree’s
will continue to grow over time.
Reloading from scratch creates a smaller database because it is able to
maximally pack the data into the Data structures on disk and you do not have
any unused identifiers allocated.
Rob
On 21/08/2017 11:20, "Lorenzo Manzoni" <[email protected]> wrote:
Hi,
I'm writing you because we have a behavior of fuseki TDB we can not
understand:
*/the fuseki database filesystem size continues to grow even if the
number of triples does not increase substantially./*
We are using the latest version of fuseki (3.4.0) as triple store of a
semantic media wiki (mw 1.24, smw 2.1.1) and all the night we have a
scheduled job that updates the wiki pages and executes maintenance
scripts(e.g.
https://www.semantic-mediawiki.org/wiki/Help:Maintenance_script_%22rebuildData.php%22)
. These scripts update the semantic data on the wiki and the triples on
fuseki. Basically every triple are rewritten.
We have observed that the fuseki database filesystem size grew over time
to 20Gb but when we recreate it from scratch the database size is only
500 Mb.
After that every day fuseki database grows about 200Mb and the number
of triples does not change substantially
I originally assumed that the rebuild data script was the problem but
when I executed it alone the fuseki database space did not increase.
We are running fueski on a 64 bit redhat machine.
Someone can help us?
Thanks in advance,
Lorenzo